Scene-Level Geographic Image Classification Based on a Covariance Descriptor Using Supervised Collaborative Kernel Coding

Scene-level geographic image classification has been a very challenging problem and has become a research focus in recent years. This paper develops a supervised collaborative kernel coding method based on a covariance descriptor (covd) for scene-level geographic image classification. First, covd is introduced in the feature extraction process and, then, is transformed to a Euclidean feature by a supervised collaborative kernel coding model. Furthermore, we develop an iterative optimization framework to solve this model. Comprehensive evaluations on public high-resolution aerial image dataset and comparisons with state-of-the-art methods show the superiority and effectiveness of our approach.


Introduction
Nowadays, high spatial resolution remote sensing images are easily acquired thanks to the rapid development of satellite and remote sensing technology, which has endowed us with the opportunity to interpret, analyze and understand the image. As a fundamental research area of remote sensing image analysis, scene-level geographic image classification is of great importance for land use and land cover (LULC) image classification [1][2][3], semantic interpretations of images [4], geographic image retrieval [5][6][7] and forest type mapping [8], which has drawn increasing attention and scholars' study [1][2][3]5,[9][10][11][12][13]. Figure 1 shows geographic images whose spatial resolution is 30 m, 1 m and 0.3 m, respectively.
However, finding an efficient representation of the scene-level image is a challenging problem. The bag of visual words (BOVW) model [14] is one of the most successful models. The works in [2,5] detailed the application of BOVW on the scene-level image classification task. As is illustrated in [2,5], BOVW can represent the image by compact representation through a visual word counts histogram and provides further invariance to the image transformations. However, the tradeoff between invariance and discriminability is controlled by the visual dictionary size. What is more, BOVW disregards the information about the spatial layout of the features, which is of great importance to scene-level image classification [2,15,16]. In order to overcome this shortcoming, one successful extension of BOVW is spatial pyramid matching (SPM) [16], which partitions the image into increasing finer sub-images and computes histograms of local features from each sub-image. Although SPM is a computationally-efficient extension of BOVW and shows superior performance, it does not consider the relative spatial arrangement and only characterizes the absolute location of the visual words in an image. From this point of view, SPM also limits the descriptive ability of the scene-level geographic image representation. Hence, two new image representation models, which are termed spatial co-occurrence kernel (SCK) [1] and spatial pyramid co-occurrence kernel (SPCK) [2], are proposed by Yang and Newsam. What is more, in order to capture the absolute and relative spatial relationships of BOVW, a pyramid of spatial relations (PSR) model is developed by Chen and Tian. The work in [17] points out that the computational complexities of SCK and SPCK are high because of the need to use nonlinear Mercer kernels and developed a linear form of the SCK. Besides, [10] proposed an unsupervised feature learning method, in which the new sparse representations of the feature descriptors are generated by the low-level feature descriptors. On the other hand, the covariance descriptor (covd) proposed by by Tuzel [18] can be used for feature representation of the image, which has been extensively adopted in vast computer vision tasks, e.g., texture discrimination [18], visual saliency estimation [19], object detection [18,20] and object tracking [21]. Covd is a covariance matrix of different features, e.g., color, gradient and spatial location, and it holds certain rotation and scale invariance. However, how to model and compute covd still remains a key problem. We all know that covd lies in the Riemannian manifold, which is a non-Euclidean space. As a result, traditional mathematical modeling and computation in Euclidean space cannot be directly utilized, which results in a great challenge. In [22], a discriminative learning method is developed to formulate the classification problem on Riemannian space by covd, which presents a kernel function and a log-Euclidean distance metric to solve Riemannian-Euclidean transformation. In [23], a coding strategy is introduced, and the descriptor can be transformed into a new feature; and then, extreme learning machine (ELM) can be used for dynamic texture video classification. However, such a method separately optimizes the reconstruction error of the coding and the classification error of ELM, and the design stage of coding and the classifier are totally independent. In order to solve this problem, a supervised collaborative kernel coding approach incorporating the linear classifier supervised term that can optimize both the reconstruction error and the linear classifier simultaneously is developed. There are three contributions as follows:

1.
A supervised collaborative kernel coding model, illustrated in Figure 2, is proposed. This model can not only transform the covd to a discriminative feature representation, but also can obtain the corresponding linear classifier.

2.
An iterative optimization framework is introduced to solve the supervised collaborative kernel coding model.

3.
Experiments on public high-resolution aerial image dataset validate that the proposed supervised collaborative kernel coding model derives a satisfying performance on the scene-level geographic image classification.
The paper is organized as follows: After a review of our proposed methodology in Section 2, Section 3 shows the iterative optimization approach. In Sections 4 and 5, we give the experiments and conclusions.  Figure 3 shows the overview of the proposed method, which consists of 3 stages, the pre-processing stage, coding stage and classification stage. In the pre-processing stage, covd is extracted as the initial feature representation of the scene-level geographic image. Then, in the coding stage, the supervised collaborative kernel coding strategy involving dictionary coefficients, the coding representation phase and the linear classification phase is presented. Finally, in the classification stage, based on the dictionary coefficients and learned linear classifier, a label vector can be simply derived through the linear classifier, the index corresponding to the largest value of which is the label of a testing scene-level geographic image.

Covariance Descriptor
Covd was first proposed by Tuzel et al. [18] as a compact descriptor. Formally, let {f k } k=1,··· ,d be a feature vector denoting the feature points of p-dimension as color, gradient filter response, etc. Then, a covd C of s × s dimensions of an image can be described as: where d and v denote the pixel number and the mean value, respectively.
The feature vector f is established using the image intensity of each channel, the norm of the first and second derivatives of intensity in the x and y directions. As for a geographic image, a feature vector f x,y = [c T R,x,y , c T G,x,y , c T B,x,y ] T of 15 dimensions is computed at each pixel (x, y), and here, , where I C and C ∈ {R, G, B} denote the the C channel intensity image and the channel of the color, respectively.
The work in [18] points out that covd has at least three characteristics: (1) it is enough to describe the image of different poses and views; (2) multiple features can be fused in a natural way through covd, the diagonal and non-diagonal elements of which describe the variance and correlations of different features, respectively; (3) comparing to other descriptors, such as raw values and the histogram, covd is low-dimensional, and it has only s 2 +s 2 different values due to symmetry. Nevertheless, covd is a symmetric positive definite matrix. The key issue for a symmetric positive definite matrix is how to model and compute it. As is illustrated in Figure 4, covd lies in a Riemannian manifold [24], which is not a Euclidean space. Accordingly, the mathematical modeling of covd is not the same as what we usually do in the Euclidean space. Here, we adopt the idea of Ruiping Wang [22] and compute the distance of two covds C 1 and C 2 using log-Euclidean distance [25,26]: where logm is the logarithm computation of the matrix and || · || F denotes the Frobenius norm. Moreover, there is a tricky problem regarding how to use covd in the geographic image classification. It is a fact that covd lies in a non-Euclidean space; thus, the traditional linear classifier based on Euclidean space cannot be directly utilized. Therefore, in the following, how to solve this problem is the theme.

Supervised Collaborative Kernel Coding Model
As is shown in Figure 5, here, we propose a supervised collaborative kernel coding model, which consists of two jointly working components: (1) the dictionary learning and feature representation phase; and (2) the linear classification phase. First, the linear classifier is incorporated into the dictionary learning and feature representation phase, making the resulting coding vector A more discriminative. Then, based on the coding vector A, the linear classifier W is derived. In this way, the objectives function in each phase are combined into a unified optimization framework, through which a collaborative coding vector and the corresponding linear classifier can be simultaneously obtained. At last, based on the dictionary coefficients V, testing signal s i is transformed into a feature vector, which is used for linear classification directly.
As for covd computation, the Gaussian kernel is chosen as the mapping function for its superior performance in vast computer vision tasks [27]: where the decay parameter β is empirically set as 0.02 and κ(x i , x j ) is the Gaussian kernel between two samples x i and x j . The aim of dictionary learning is to empirically learn a dictionary adapted to the training sample set; therefore, we need to determine some atoms d 1 , · · · , d K ∈ P to represent the training samples, where K is the dictionary size and K < N. Let Φ(X) = [Φ(x) 1 , · · · , Φ(x) N ] ∈ R m×K , and the kernel dictionary learning process can be formulated as: where A ∈ R K×N is the coding matrix and λ is the penalty parameter.
Thanks to the kernel trick [28,29], through the mapping function Φ(·), the problem on the Riemannian manifold can be transformed to a collaborative coding problem in the Euclidean space. Nevertheless, since the number of dictionary atoms d j may be infinite, there exists a new challenge to the dictionary learning process in such a formulation. Fortunately, [30,31] prove that the dictionary D can be represented as D = Φ(X)V, where V ∈ R N×K is a coefficient matrix. This indicates that the training samples can linearly represent the dictionary in the feature space. As a result, Equation (4) can be reformulated as: Such a formulation provides two significant advantages: (1) the dictionary learning process becomes searching the matrix V; (2) for any kernel function, this formulation reduces the dictionary learning process to linear problems. Now, we propose a novel objective function combining both the collaborative kernel coding phase and classification phase as: where ||Φ(X) − Φ(X)VA|| 2 2 and ||L − WA|| 2 2 denotes the reconstruction error and the linear classification error, respectively, and W represents the classifier parameters. η, λ and ρ are all penalty parameters.
The derived dictionary through this formulation can generate more discriminative codes A, which is of great importance to the performance of the classifier and also adaptive to the underlying structure of training samples. The resulting codes A are then directly used for classification.
For a testing sample s i , through Equation (7), the feature representation code z i is firstly computed with dictionary coefficients V. Then, in order to derive the label vector, we can use l i = Wz i . The index corresponding to the largest value of l i is the label of s i .

Optimization Algorithm
There are three variables as V, A and W in the objective function Equation (6). Here, an iterative optimization algorithm for each variable by fixing the other two is introduced. (Equation (6) is denoted as F (V, A, W), and the obtained variables from the k-th and (k + 1)-th iteration are denoted as the subscripts (k) and (k + 1), respectively, and k = 0, · · · , N − 1).
Step 1: Initialization. We randomly set coefficient matrix V 0 ∈ R N×K . Next, we compute the corresponding coding coefficient A by taking the derivative of A of Equation (6): where K(X, X) is an N × N square matrix of which the (i, j)-th element is κ(x i , x j ).
Step 2: Fixing A, taking the derivative of V: Additionally, the corresponding solution is: Step 3: Fixing V and A, taking the derivative of W, we can derive the optimal solution of W.
Step 4: Fixing V and W, and taking the derivative of A: Then, the optimal solution of A is: Step 5: Iteration from Step 2 to Step 4 until convergence.
A whole algorithm summary, which includes the above optimization procedures, is given in Algorithm 1, and the representative reconstruction error of the objective function is shown in Figure 6. In case of the optimal A, we can derive the optimal solution of z based on Equation (7) as: where Algorithm 1. The Iteration Optimization Procedure. Input:  (10) 4. Fixing V (k+1) and A (k) , update W (k+1) according to Equation (12) 5. Fixing V (k+1) and W (k+1) , update A (k+1) according to Equation (14) 6. end while Figure 6. The representative reconstruction error of the objective function.

Dataset and Experiment Setup
In this section we demonstrate the application of our method in the classification experiments using a publicly available dataset [1], which includes twenty one scene categories with one hundred images of each class. This dataset corresponds to various land LULC types, which is shown in Figure 7. For each category, it is randomly partitioned into five subsets, and each subset contains twenty samples. During the experiments, one subset is used for testing, and the remaining four subsets are used for training. Finally, we report the average classification accuracy.

Parameter Analysis
Equation (6) has four parameters, λ, η, ρ and dictionary size K, which need to be tuned. In order to determine their values, n-fold cross-validation is adopted. Each parameter is investigated by fixing the other parameters. It is noted that the initialization of K is 210. Figure 8 shows the classification accuracy of each tuned parameter. It is easy to find that our approach obtains the best performance (83.81%) when λ = 0.001, η = 1, ρ = 0.1 (or 1).

Experiment Results and Comparison
The following three baseline methods are designed for comparison:

1.
This method isolates the feature representation and classification process, which means that A 0 is used as the feature representation and that W 0 is used as the linear classifier.

2.
This method is the same as the proposed method, except that the covd is established based on image intensities and the magnitude of the first and second gradients. Namely, This method is the same as baseline Method 1, except that the covd is a 9 × 9 matrix, which is the same as baseline Method 2. Figure 9 shows the classification accuracy versus dictionary size K. From this figure, we can find some interesting results:

1.
Our approach is always better than the three baseline methods, and when K = 357, our approach obtains the best performance (87.14%).

2.
Comparing to baseline Method 1, our proposed method obtains a higher classification accuracy, which indicates the effectiveness of the optimization algorithm.

3.
Comparing the proposed method to baseline Method 2, the only difference is the covd. The former uses a 15 × 15 matrix, which is a covariance format of intensity of each channel and the norms of the first and second gradients of intensities, while the covd of the latter is a 9 × 9 covariance format of the intensity of each channel and the magnitude of the first and second gradients. It is clear that both covds are not rotationally invariant, especially that the former covd is not direction invariant. However, the proposed method obtains a higher classification accuracy. This may indicate that the covariance format offsets the rotations to some extent.   Figure 10 shows the confusion matrices of the baseline methods and our approach, respectively. The classification accuracy of fifteen categories is more than 80%, and eleven categories are more than 90%. Nevertheless, the classification accuracy of three categories, buildings, dense residential and intersection, is less than 70%.
In order to analyze the proposed method, Figure 11 lists some representative misclassification samples of the proposed method. Some misclassification couples, such as intersection/overpass, overpass/runway and river/forest, shown in Figure 11 are hard to identify, even with our own eyes. Besides, we report the classification accuracies of both baseline methods and our method over groups rotated five times in Table 1. Then, the comparison with the classical approaches [1,2], BOVW, SPM, SCK, BOVW + SCK, color histogram, such as RGB, HLS and CIE Lab, texture, SPCK, BOVW + SPCK and SPCK + SPM, is shown in Figure 12.  Figure 12. The comparison of our method with the state-of-the-art performance reported in the literature on the dataset UCMERCED. BOVW, bag of visual words; SPM, spatial pyramid matching; SCK, spatial co-occurrence kernel; SPCK, spatial pyramid co-occurrence kernel.

Conclusions
This paper proposes a novel supervised collaborative kernel coding model based on covd for scene-level geographic image classification. Since covd lies in non-Euclidean space, the linear classifier, which is based on Euclidean distance, cannot be utilized. Additionally, our main contribution is explicitly integrating the discriminative feature coding and a linear classifier into the objective function. Moreover, the solution to the new objective function is efficiently achieved by simply employing the optimization algorithm. Experiments implemented on the UCMERCED dataset show the effectiveness of our approach.