Remote Sensing Image Scene Classification Using Multi-Scale Completed Local Binary Patterns and Fisher Vectors

An effective remote sensing image scene classification approach using patch-based multi-scale completed local binary pattern (MS-CLBP) features and a Fisher vector (FV) is proposed. The approach extracts a set of local patch descriptors by partitioning an image and its multi-scale versions into dense patches and using the CLBP descriptor to characterize local rotation invariant texture information. Then, Fisher vector encoding is used to encode the local patch descriptors (i.e., patch-based CLBP features) into a discriminative representation. To improve the discriminative power of feature representation, multiple sets of parameters are used for CLBP to generate multiple FVs that are concatenated as the final representation for an image. A kernel-based extreme learning machine (KELM) is then employed for classification. The proposed method is extensively evaluated on two public benchmark remote sensing image datasets (i.e., the 21-class land-use dataset and the 19-class satellite scene dataset) and leads to superior classification performance (93.00% for the 21-class dataset with an improvement of approximately 3% when compared with the state-of-the-art MS-CLBP and 94.32% for the 19-class dataset with an improvement of approximately 1%).


Introduction
Remote sensing is an effective tool for Earth observation, which has been widely applied in surveying land-use and land-cover classifications and monitoring their dynamic changes.With the improvement of spatial resolution, remote-sensing images present more detailed information such as spatial arrangement information and textural structures, which are of great help in recognizing different land-use and land-cover scene categories.The goal of image scene classification is to recognize the semantic categories of a given image based on some priori knowledge.Due to intra-class variations and wide range of illumination and scale changes, scene classification of high-resolution remote sensing images remains a challenging problem.
The last decade saw considerable efforts to employ computer vision techniques to classify aerial or satellite image scenes.The bag-of-visual-words (BOVW) model [1], which is one of the most popular approaches in image analysis and classification applications, provides an efficient approach to solve the problem of scene classification.The BOVW model, derived from document classification in text analysis, represents an image as a histogram of frequencies of a set of visual words by mapping the discriminative power of the feature representation, multiple sets of parameters for the CLBP operator (i.e., MS-CLBP) were utilized to generate multiple FVs.The final representation for an image is achieved by concatenating all the FVs.For classification, the kernel-based extreme learning machine (KELM) [20] is adopted for its efficient computation and good classification performance.
There are two main contributions from this work.First, a local feature representation method using patch-based MS-CLBP features and FV is proposed.The MS-CLBP operator is applied to the partitioned dense regions to extract a set of local patch descriptors, and then the Fisher kernel representation is used to encode the local descriptors into a discriminative representation of remote sensing images.Second, the two implementations of MS-CLBP are combined into a unified framework to build a more powerful feature representation.The proposed local feature representation method is evaluated using two public benchmark remote sensing image datasets.The experimental results verify the effectiveness of our proposed method as compared to state-of-the-art algorithms.
The remainder of the paper is organized as follows.Section 2 presents the related works including CLBP and the Fisher vector.Section 3 describes two implementations of MS-CLBP, patch-based MS-CLBP feature extraction, and the details of the proposed feature representation method.Section 4 provides the experimental results.Finally, Section 5 concludes the paper.

Completed Local Binary Patterns
Local binary patterns (LBP) [21,22] are an effective measure of spatial structure information of local image texture.Consider a center pixel and its gray value, t c .Its neighboring pixels are equally spaced on a circle of radius r with the center at location t c .If the coordinates of t c are p0, 0q and m neighbors tt i u m´1 i"0 are considered, the coordinates of t i are denoted as p´rsinp2πi{mq, rcosp2πi{mqq.Then, the LBP is calculated by thresholding the neighbors tt i u m´1 i"0 with the center pixel t c to generate an m-bit binary number.The resulting LBP for t c in decimal number can be expressed as follows: LBP m,r pt c q " m´1 ÿ i"0 s pt i ´tc q 2 i " m´1 ÿ i"0 s pd i q 2 i , spxq " # 1, x ě 0 0, x ă 0 (1) where d i " pt i ´tc q represents the difference between the center pixel and each neighbor, which characterizes the spatial local structure at the center location.Further, the resulted d i is robust to illumination changes and they are more efficient than the original image in pattern classification due to the fact that the central gray level t c is removed.The difference vector d i can be further decomposed into two components: the signs and magnitudes (absolute values of d i , i.e., |d i |).However, the original LBP only uses the sign information of d i while ignoring the magnitude information.In the improved CLBP [18], the signs and magnitudes are complementary, from which the difference vector d i can be perfectly reconstructed.Figure 1 illustrates an example of the sign and magnitude components of the CLBP extracted from a sample block, where Figure 1a-d denote original 3 ˆ3 local structure, difference vector, sign vector and magnitude vector, respectively.Note that "0" is coded as "´1" in CLBP (as seen in Figure 1c).Two operators, CLBP-Sign (CLBP_S) and CLBP-Magnitude (CLBP_M), are used to encode these two components.CLBP_S is equivalent to the traditional LBP operator while the CLBP_M operator can be expressed as, CLBP_M m,r " where c is a threshold that is set to the mean value of |d i |.Using Equations ( 1) and ( 2), two binary strings can be produced and denoted as CLBP_S and CLBP_M codes, respectively.Two ways to combine the CLBP_S and CLBP_M codes are presented in [18].Here, the first way (concatenation) is used, in which the histograms of the CLBP_S and CLBP_M codes are calculated separately, and then the two histograms are concatenated.Note that there is also the CLBP-Center part which codes the values of the center pixels in the original CLBP.Here, only the CLBP_S and CLBP_M operators are considered for computational efficiency.
where c is a threshold that is set to the mean value of i d .Using Equations ( 1) and ( 2), two binary strings can be produced and denoted as CLBP_S and CLBP_M codes, respectively.Two ways to combine the CLBP_S and CLBP_M codes are presented in [18].Here, the first way (concatenation) is used, in which the histograms of the CLBP_S and CLBP_M codes are calculated separately, and then the two histograms are concatenated.Note that there is also the CLBP-Center part which codes the values of the center pixels in the original CLBP.Here, only the CLBP_S and CLBP_M operators are considered for computational efficiency.

Fisher Vector
After local feature extraction (especially for patch-based feature extraction), the popular BOVW model is usually employed to encode features into histograms.However, the BOVW model has an intrinsic limitation, namely the computational cost in assignment of local features to visual words, which scales as the product of the number of visual words, the number of regions and the local feature dimensionality [23].Several extensions to the basic BOVW model to build compact vocabularies have been proposed.The most appealing one is the Fisher kernel image representation [19,24], which uses high-dimensional gradient representation to represent an image.Due to informative representations with compact vocabularies, this representation contains more information than a simple histogram representation.
An FV is a special case of Fisher kernel construction.Let be the set of local patch descriptors extracted from an image.A Gaussian mixture model (GMM) is trained on the training images using Maximum Likelihood (ML) estimation [25,26].Let denote the probability density function of the GMM with parameters { , , , 1... } , where is the number of components.i ω , i μ and i Σ are the mixture weight, mean vector, and covariance matrix of the th i Gaussian component, respectively.The image can be characterized by the gradient of the log-likelihood of the data on the model: (3)  where c is a threshold that is set to the mean value of i d .Using Equations ( 1) and ( 2), two binary strings can be produced and denoted as CLBP_S and CLBP_M codes, respectively.Two ways to combine the CLBP_S and CLBP_M codes are presented in [18].Here, the first way (concatenation) is used, in which the histograms of the CLBP_S and CLBP_M codes are calculated separately, and then the two histograms are concatenated.Note that there is also the CLBP-Center part which codes the values of the center pixels in the original CLBP.Here, only the CLBP_S and CLBP_M operators are considered for computational efficiency.

Fisher Vector
After local feature extraction (especially for patch-based feature extraction), the popular BOVW model is usually employed to encode features into histograms.However, the BOVW model has an intrinsic limitation, namely the computational cost in assignment of local features to visual words, which scales as the product of the number of visual words, the number of regions and the local feature dimensionality [23].Several extensions to the basic BOVW model to build compact vocabularies have been proposed.The most appealing one is the Fisher kernel image representation [19,24], which uses high-dimensional gradient representation to represent an image.Due to informative representations with compact vocabularies, this representation contains more information than a simple histogram representation.
An FV is a special case of Fisher kernel construction.Let { , 1 ... } t X x t T = = be the set of local patch descriptors extracted from an image.A Gaussian mixture model (GMM) is trained on the training images using Maximum Likelihood (ML) estimation [25,26].Let denote the probability density function of the GMM with parameters { , , , 1... } , where is the number of components.i ω , i μ and i Σ are the mixture weight, mean vector, and covariance matrix of the th i Gaussian component, respectively.The image can be characterized by the gradient of the log-likelihood of the data on the model: (3)

Fisher Vector
After local feature extraction (especially for patch-based feature extraction), the popular BOVW model is usually employed to encode features into histograms.However, the BOVW model has an intrinsic limitation, namely the computational cost in assignment of local features to visual words, which scales as the product of the number of visual words, the number of regions and the local feature dimensionality [23].Several extensions to the basic BOVW model to build compact vocabularies have been proposed.The most appealing one is the Fisher kernel image representation [19,24], which uses high-dimensional gradient representation to represent an image.Due to informative representations with compact vocabularies, this representation contains more information than a simple histogram representation.
An FV is a special case of Fisher kernel construction.Let X " tx t , t " 1 ... Tu be the set of local patch descriptors extracted from an image.A Gaussian mixture model (GMM) is trained on the training images using Maximum Likelihood (ML) estimation [25,26].Let P denote the probability density function of the GMM with parameters λ " tω i , µ i , Σ i , i " 1...Ku, where K is the number of components.ω i , µ i and Σ i are the mixture weight, mean vector, and covariance matrix of the i th Gaussian component, respectively.The image can be characterized by the gradient of the log-likelihood of the data on the model: The gradient describes the direction along which parameters are to be adjusted to best fit the data.Under an independence assumption, the covariance matrices are diagonal, i.e., Σ i " diag `σ2 i ˘.
Then following [27], LpX|λq " logPpX|λq is written as, The probability density function of x t generated by the GMM is Let γ t piq be the occupancy probability, i.e., the probability of descriptor x t generated by the i-th Gaussian.
with the Bayes formula mathematical derivations providing the following results, where d denotes the d th dimension of a vector.The diagonal closed-form approximation in [27] is used to normalize the gradient vector by multiplying the square-root of the inverse of the Fisher information Thus, the normalized partial derivatives are The final gradient vector (i.e., FV) is just a concatenation of all the partial derivative vectors.Therefore, the dimensionality of FV is p2D `1q ˆK, where D denotes the size of the local descriptors.

Proposed Feature Representation Method
Inspired by the success of CLBP and FV in computer vision applications, we propose an effective image representation approach for remote sensing image scene classification based on patch-based MS-CLBP features and FV.The patch-based MS-CLBP is applied as the local patch descriptors and then the FV is chosen as the encoding strategy to generate a high-dimensional representation of an image.

Two Implementations of Multi-Scale Completed Local Binary Patterns
CLBP features computed from a single-scale may not be able to detect the dominant texture features from an image.A possible solution is to characterize the image texture in multiple resolutions, i.e., MS-CLBP.There are two implementations for the MS-CLBP descriptor [17].
In the first implementation, the radius of a circle r is altered to change the spatial resolution.The multi-scale analysis is accomplished by combining the information provided by multiple operators of varying pm, rq.For simplicity, the number of neighbors is fixed to m and different values of r are tuned to achieve the optimal combination.An example of a 3-scale (three r values) CLBP operator is illustrated in Figure 3.The CLBP_S and CLBP_M histogram features extracted from each scale are concatenated to form an MS-CLBP representation.One disadvantage of this multi-scale analysis implementation is that the computational complexity increases due to multiple resolutions.
descriptors and then the FV is chosen as the encoding strategy to generate a high-dimensional representation of an image.

Two Implementations of Multi-Scale Completed Local Binary Patterns
CLBP features computed from a single-scale may not be able to detect the dominant texture features from an image.A possible solution is to characterize the image texture in multiple resolutions, i.e., MS-CLBP.There are two implementations for the MS-CLBP descriptor [17].
In the first implementation, the radius of a circle is altered to change the spatial resolution.The multi-scale analysis is accomplished by combining the information provided by multiple operators of varying ( , ) m r .For simplicity, the number of neighbors is fixed to m and different values of r are tuned to achieve the optimal combination.An example of a 3-scale (three r values) CLBP operator is illustrated in Figure 3.The CLBP_S and CLBP_M histogram features extracted from each scale are concatenated to form an MS-CLBP representation.One disadvantage of this multi-scale analysis implementation is that the computational complexity increases due to multiple resolutions.In the second implementation, the original image is down-sampled using the bicubic interpolation to obtain multiple images at different scales.The value of scale is between 0 and 1 (here, 1 denotes the original image).Then, the CLBP_S and CLBP_M operators of fixed radius and the number of neighbors are applied to the multiple-scale images.The CLBP_S and CLBP_M histogram features extracted from each scale image are concatenated to form an MS-CLBP representation.An example of the second implementation of the MS-CLBP descriptor is shown in Figure 4.

Patch-Based MS-CLBP Feature Extraction
Given an image, the CLBP [18] operator with a parameter set ( , ) m r is applied to generate two CLBP coded images with one corresponding to the sign component (i.e., CLBP_S coded image) In the second implementation, the original image is down-sampled using the bicubic interpolation to obtain multiple images at different scales.The value of scale is between 0 and 1 (here, 1 denotes the original image).Then, the CLBP_S and CLBP_M operators of fixed radius and the number of neighbors are applied to the multiple-scale images.The CLBP_S and CLBP_M histogram features extracted from each scale image are concatenated to form an MS-CLBP representation.An example of the second implementation of the MS-CLBP descriptor is shown in Figure 4.
descriptors and then the FV is chosen as the encoding strategy to generate a high-dimensional representation of an image.

Two Implementations of Multi-Scale Completed Local Binary Patterns
CLBP features computed from a single-scale may not be able to detect the dominant texture features from an image.A possible solution is to characterize the image texture in multiple resolutions, i.e., MS-CLBP.There are two implementations for the MS-CLBP descriptor [17].
In the first implementation, the radius of a circle is altered to change the spatial resolution.The multi-scale analysis is accomplished by combining the information provided by multiple operators of varying ( , ) m r .For simplicity, the number of neighbors is fixed to m and different values of r are tuned to achieve the optimal combination.An example of a 3-scale (three r values) CLBP operator is illustrated in Figure 3.The CLBP_S and CLBP_M histogram features extracted from each scale are concatenated to form an MS-CLBP representation.One disadvantage of this multi-scale analysis implementation is that the computational complexity increases due to multiple resolutions.In the second implementation, the original image is down-sampled using the bicubic interpolation to obtain multiple images at different scales.The value of scale is between 0 and 1 (here, 1 denotes the original image).Then, the CLBP_S and CLBP_M operators of fixed radius and the number of neighbors are applied to the multiple-scale images.The CLBP_S and CLBP_M histogram features extracted from each scale image are concatenated to form an MS-CLBP representation.An example of the second implementation of the MS-CLBP descriptor is shown in Figure 4.

Patch-Based MS-CLBP Feature Extraction
Given an image, the CLBP [18] operator with a parameter set ( , ) m r is applied to generate two CLBP coded images with one corresponding to the sign component (i.e., CLBP_S coded image)

Patch-Based MS-CLBP Feature Extraction
Given an image, the CLBP [18] operator with a parameter set pm, rq is applied to generate two CLBP coded images with one corresponding to the sign component (i.e., CLBP_S coded image) and the other the magnitude component (i.e., CLBP_M coded image).Two complementary components of CLBP (CLBP_S and CLBP_M) can capture the spatial patterns and contrast of local image texture, such as edges and corners.Then, the CLBP coded images are partitioned into B ˆB overlapped patches in an image grid.For simplicity, the overlap between two patches is half of the patch size (i.e., B{2) in both horizontal and vertical directions.To incorporate spatial structures of an image at different scales (or sizes) and create more patch descriptors, here the second implementation of MS-CLBP is employed by resizing the original image to different scales (e.g., 1{2 and 1{3 of the original image).Specifically, the CLBP operator with the same parameter set is applied to the multi-scale images to generate patch-based CLBP histogram features.For patch i, two occurrence histograms (i.e., the nonparametric statistical estimate) are computed from the sign component (CLBP_S) and the magnitude component (CLBP_M).A histogram feature vector denoted by h i is formed by concatenating the two histograms.If M patches are extracted from the multi-scale images, a feature matrix denoted by H " rh 1 , h 2 , ..., h M s is generated to represent the original image.Each column of the matrix H is a histogram feature vector for a patch.The proposed patch-based CLBP feature extraction method is illustrated in Figure 5.
and the other the magnitude component (i.e., CLBP_M coded image).Two complementary components of CLBP (CLBP_S and CLBP_M) can capture the spatial patterns and contrast of local image texture, such as edges and corners.Then, the CLBP coded images are partitioned into overlapped patches in an image grid.For simplicity, the overlap between two patches is half of the patch size (i.e., 2 B ) in both horizontal and vertical directions.To incorporate spatial structures of an image at different scales (or sizes) and create more patch descriptors, here the second implementation of MS-CLBP is employed by resizing the original image to different scales (e.g., 1 2 and 1 3 of the original image).Specifically, the CLBP operator with the same parameter set is applied to the multi-scale images to generate patch-based CLBP histogram features.For patch i , two occurrence histograms (i.e., the nonparametric statistical estimate) are computed from the sign component (CLBP_S) and the magnitude component (CLBP_M).

A Fisher Kernel Representation
Fisher kernel representation [19] is an effective patch aggregation mechanism to characterize a sample of low-level features, and it exhibits superior performance over the BOVW model.Therefore, the Fisher kernel representation is employed to encode the dense local patch descriptors.As noted in [21], LBP features computed from a single scale may not be able to represent intrinsic texture features.Therefore, different parameter sets pm, rq are utilized for the CLBP operator to achieve the first implementation of the MS-CLBP as described in [17].Specifically, the number of neighbors (m) is fixed and multiple radii (r) are used in the patch-based CLBP feature extraction as shown in Figure 5.If q parameter sets (i.e., pm, r 1 q, pm, r 2 q, ..., pm, r q q ( ) are considered, a set of q feature matrices denoted by !H pm,r 1 q , H pm,r 2 q , ..., H pm,r q q ) can be obtained for an image.It is worth noting that the proposed patch-based MS-CLBP feature extraction effectively unifies the two implementations of the MS-CLBP [17].

A Fisher Kernel Representation
Fisher kernel representation [19] is an effective patch aggregation mechanism to characterize a sample of low-level features, and it exhibits superior performance over the BOVW model.Therefore, the Fisher kernel representation is employed to encode the dense local patch descriptors.
Given N T training images with N T feature matrices, ) representing the local patch descriptors (i.e., patch-based CLBP features) of each image are obtained using the feature extraction method illustrated in Figure 5.
Since q parameter sets (i.e., pm, r 1 q, pm, r 2 q, ..., pm, r q q ( ) are employed for the CLBP operator, each image yields q feature matrices denoted by !H rjs pm,r 1 q , H rjs pm,r 2 q , ..., H rjs pm,r q q ) , where j P r1, 2, ..., N T s.For each CLBP parameter set, the corresponding feature matrices of the training data are used to estimate the GMM parameters via the Expectation-Maximization (EM) algorithm.Therefore, for q CLBP parameter sets, q GMMs are created.
After estimating the GMM parameters, q FVs are obtained for an image.Then, the q FVs are simply concatenated as the final feature representation.Figure 6 shows the detailed procedure for generating FVs.As illustrated in Figure 6, the stacked FVs (f) from the q CLBP parameter sets serve as the final feature representation of an image before being fed into a classifier.
GMMs are created.After estimating the GMM parameters, q FVs are obtained for an image.Then, the FVs are simply concatenated as the final feature representation.Figure 6 shows the detailed procedure for generating FVs.As illustrated in Figure 6, the stacked FVs ( ) from the q CLBP parameter sets serve as the final feature representation of an image before being fed into a classifier.

Experiments
Two standard public domain datasets are used to demonstrate the effectiveness of the proposed image representation method for remote sensing land-use scene classification.In the experiments, KELM with a radial basis function (RBF) kernel is employed for classification due to its generally excellent classification performance and low computational cost.The classification performance of the proposed method is compared with the state-of-the-art in the literature.

Experimental Data and Setup
The first dataset is the well-known UC-Merced land-use dataset [28].It is the first public ground truth land-use scene image dataset that consists of 21 land-use classes and each class contains 100 images with a size of 256 × 256 pixels.The images were manually extracted from aerial orthoimagery downloaded from the United States Geological Survey (USGS) National Map.This is a challenging dataset due to a variety of spatial patterns in those 21 classes.Sample images of each land-use class are shown in Figure 7.To facilitate a fair comparison, the same experimental setting reported in [28] is followed.Five-fold cross-validation is performed, in which the dataset is randomly partitioned into five equal subsets.There are 20 images from each land-use class in a subset.Four subsets are used for training and the remaining subset for testing.The classification accuracy is the average over the five cross-validation evaluations.

Experiments
Two standard public domain datasets are used to demonstrate the effectiveness of the proposed image representation method for remote sensing land-use scene classification.In the experiments, KELM with a radial basis function (RBF) kernel is employed for classification due to its generally excellent classification performance and low computational cost.The classification performance of the proposed method is compared with the state-of-the-art in the literature.

Experimental Data and Setup
The first dataset is the well-known UC-Merced land-use dataset [28].It is the first public ground truth land-use scene image dataset that consists of 21 land-use classes and each class contains 100 images with a size of 256 ˆ256 pixels.The images were manually extracted from aerial orthoimagery downloaded from the United States Geological Survey (USGS) National Map.This is a challenging dataset due to a variety of spatial patterns in those 21 classes.Sample images of each land-use class are shown in Figure 7.To facilitate a fair comparison, the same experimental setting reported in [28] is followed.Five-fold cross-validation is performed, in which the dataset is randomly partitioned into five equal subsets.There are 20 images from each land-use class in a subset.Four subsets are used for training and the remaining subset for testing.The classification accuracy is the average over the five cross-validation evaluations.The second dataset used in our experiments is the 19-class satellite scene dataset [29].It consists of 19 classes of high-resolution satellite scenes.There are 50 images with sizes of 600 × 600 pixels for each class.The images are extracted from large satellite images on Google Earth.An example of each class is shown in Figure 8.The same experimental setup in [30] was used.Here, 30 images are randomly selected per class as training data and the remaining images as testing data.The experiment is repeated 10 times with different realizations of randomly selected training and testing images.Classification accuracy is averaged over the 10 trials.The second dataset used in our experiments is the 19-class satellite scene dataset [29].It consists of 19 classes of high-resolution satellite scenes.There are 50 images with sizes of 600 ˆ600 pixels for each class.The images are extracted from large satellite images on Google Earth.An example of each class is shown in Figure 8.The same experimental setup in [30] was used.Here, 30 images are randomly selected per class as training data and the remaining images as testing data.The experiment is repeated 10 times with different realizations of randomly selected training and testing images.Classification accuracy is averaged over the 10 trials.The second dataset used in our experiments is the 19-class satellite scene dataset [29].It consists of 19 classes of high-resolution satellite scenes.There are 50 images with sizes of 600 × 600 pixels for each class.The images are extracted from large satellite images on Google Earth.An example of each class is shown in Figure 8.The same experimental setup in [30] was used.Here, 30 images are randomly selected per class as training data and the remaining images as testing data.The experiment is repeated 10 times with different realizations of randomly selected training and testing images.Classification accuracy is averaged over the 10 trials.Note that the original images in these two datasets are color images; the images are converted from the RGB color space to the YCbCr color space, and the Y component (luminance) is used for scene classification.

Parameters Setting
The number of neighbors (m) in the CLBP operator has a direct impact on the dimensionality of the FV since patch-based CLBP features are used as local patch descriptors.A large value of m will increase the feature dimensionality and then increase the computational complexity.Based on the parameter tuning results in [17], m " 8 is empirically chosen for both the 21-class land-use dataset and the 19-class satellite scene dataset as it balances the classification performance and computational complexity.In addition, the parameter settings in [17] are adopted for the MS-CLBP descriptor.Specifically, 6 radii (i.e., r " r1 : 6s) are used for the MS-CLBP descriptor, resulting 6 parameters sets tpm " 8, r 1 " 1q, ..., pm " 8, r 6 " 6qu.
Then, the number of scales is studied for the first implementation of the MS-CLBP operator for generating multi-scale images and the number of Gaussians (K) in the GMM.Note that the original images in these two datasets are color images; the images are converted from the RGB color space to the YCbCr color space, and the Y component (luminance) is used for scene classification.

Parameters Setting
The number of neighbors ( m ) in the CLBP operator has a direct impact on the dimensionality of the FV since patch-based CLBP features are used as local patch descriptors.A large value of will increase the feature dimensionality and then increase the computational complexity.Based on the parameter tuning results in [17], 8 m = is empirically chosen for both the 21-class land-use dataset and the 19-class satellite scene dataset as it balances the classification performance and computational complexity.In addition, the parameter settings in [17] are adopted for the MS-CLBP descriptor.Specifically, 6 radii (i.e.,    In addition, to gain computational efficiency, principal component analysis (PCA) [31,32] is employed to reduce the dimensionality of FV features.The PCA projection matrix was calculated using the features of the training data, and the principal components that accounted for 95% of the total variation of the training features are considered in our experiments.Thus, the optimal number of Gaussians for the 21-class land-use dataset is 35 and the optimal multiple scales is 1{r1 : 4s simultaneously considering classification accuracy and computational complexity.Similarly, the optimal number of Gaussians for the 19-class satellite scene dataset is 20 and the optimal multiple scale is 1{r1 : 4s.
Since the proposed method extracts dense local patches, the size of the patch (B ˆB) is determined empirically.The classification accuracies with varying patch sizes are illustrated in Figure 11.It is obvious that B " 32 achieves the best classification performance for the 21-class land-use dataset.The size of the images in the 19-class dataset is 600 ˆ600 pixels, which is about twice the size of the images in the 21-class dataset.Therefore, the patch size is set a B " 64 for the 19-class dataset.
In addition, to gain computational efficiency, principal component analysis (PCA) [31,32] is employed to reduce the dimensionality of FV features.The PCA projection matrix was calculated using the features of the training data, and the principal components that accounted for 95% of the total variation of the training features are considered in our experiments.In addition, to gain computational efficiency, principal component analysis (PCA) [31,32] is employed to reduce the dimensionality of FV features.The PCA projection matrix was calculated using the features of the training data, and the principal components that accounted for 95% of the total variation of the training features are considered in our experiments.

FV Representation vs. BOVW Model
To verify the advantage of FV as compared to the BOVW model, the MS-CLBP+BOVW is applied to both the 21-class land-use dataset and the 19-class satellite scene dataset and the performance is compared with our approach.The same parameters are used for the MS-CLBP feature.In the BOVW model, 30,000 patches are randomly selected from all patches and K-means clustering is employed to generate 1024 visual words as a typical setting.The classification performance of the proposed method and MS-CLBP+BOVW is evaluated over each category of the two datasets as shown in Figures 12  and 13, respectively.As can be seen from Figure 12, the proposed method provides better performance than MS-CLBP+BOVW in almost all categories except two, medium density residential and parking lot, and two categories (agricultural and forest) have equal performance.In Figure 13, the proposed method achieves greater accuracy than all classes except beach and industrial for the 19-class satellite scene dataset.

FV Representation vs. BOVW Model
To verify the advantage of FV as compared to the BOVW model, the MS-CLBP+BOVW is applied to both the 21-class land-use dataset and the 19-class satellite scene dataset and the performance is compared with our approach.The same parameters are used for the MS-CLBP feature.In the BOVW model, 30,000 patches are randomly selected from all patches and K-means clustering is employed to generate 1024 visual words as a typical setting.The classification performance of the proposed method and MS-CLBP+BOVW is evaluated over each category of the two datasets as shown in Figures 12 and 13, respectively.As can be seen from Figure 12, the proposed method provides better performance than MS-CLBP+BOVW in almost all categories except two, medium density residential and parking lot, and two categories (agricultural and forest) have equal performance.In Figure 13, the proposed method achieves greater accuracy than all classes except beach and industrial for the 19-class satellite scene dataset.

Comparison to the State-of-the-Art Methods
In this section, the effectiveness of the proposed image representation method is evaluated by comparing its performance with previously reported performance in the literature.Specifically, the proposed method is compared with the MS-CLBP descriptor [17] applied to an entire remote sensing image to obtain a global feature representation.The comparison results are reported in Table 1.From the comparison results, the proposed method achieves superior classification

FV Representation vs. BOVW Model
To verify the advantage of FV as compared to the BOVW model, the MS-CLBP+BOVW is applied to both the 21-class land-use dataset and the 19-class satellite scene dataset and the performance is compared with our approach.The same parameters are used for the MS-CLBP feature.In the BOVW model, 30,000 patches are randomly selected from all patches and K-means clustering is employed to generate 1024 visual words as a typical setting.The classification performance of the proposed method and MS-CLBP+BOVW is evaluated over each category of the two datasets as shown in Figures 12 and 13, respectively.As can be seen from Figure 12, the proposed method provides better performance than MS-CLBP+BOVW in almost all categories except two, medium density residential and parking lot, and two categories (agricultural and forest) have equal performance.In Figure 13, the proposed method achieves greater accuracy than all classes except beach and industrial for the 19-class satellite scene dataset.

Comparison to the State-of-the-Art Methods
In this section, the effectiveness of the proposed image representation method is evaluated by comparing its performance with previously reported performance in the literature.Specifically, the proposed method is compared with the MS-CLBP descriptor [17] applied to an entire remote sensing image to obtain a global feature representation.The comparison results are reported in Table 1.From the comparison results, the proposed method achieves superior classification

Comparison to the State-of-the-Art Methods
In this section, the effectiveness of the proposed image representation method is evaluated by comparing its performance with previously reported performance in the literature.Specifically, the proposed method is compared with the MS-CLBP descriptor [17] applied to an entire remote sensing image to obtain a global feature representation.The comparison results are reported in Table 1.From the comparison results, the proposed method achieves superior classification performance over other existing methods, which demonstrates the effectiveness of the proposed image representation for remote sensing land-use scene classification.The improvement of the proposed method over the global representation developed in [17] is 2.4%.This improvement is mainly due to the proposed local feature representation framework which unifies the two implementations of the MS-CLBP descriptor.Moreover, the proposed approach is an approximately 4.7% improvement over the MS-CLBP + BOVW method, which verifies the advantage of the Fisher kernel representation as compared to the BOVW model.Figure 14 shows the confusion matrix of the proposed method for the 21-class land-use dataset.The diagonal elements of the matrix denote the mean class-specific classification accuracy (%).We find an interesting phenomenon from Figure 14 that diagonal elements for beach and forest are extremely large but diagonal elements for storage tank is relatively small.The reasons are that images of beach and forest present rich texture and structures information; within-class similarity for the beach and forest categories is high but relatively low for category of storage tank; and some images of storage tank are similar to other class such as buildings.
The comparison results for the 19-class satellite scene dataset are listed in Table 2.It indicates that the proposed method outperforms other existing methods and achieves the best performance.The proposed method provides about 7% improvement over the method in [31] which utilized a combination of multiple sets of features, indicating the superior discriminative power of the proposed feature representation.The confusion matrix of the proposed method for the 19-class satellite scene dataset is shown in Figure 15.From diagonal elements of the matrix, the classification accuracy for bridges is relatively small because some texture information in the images of bridges is similar to those in the images of ports.

Method
Accuracy (Mean ˘std) Bag of colors [25] 70.6 ˘1.5 Tree of c-shapes [25] 80.4 ˘1.8 Bag of SIFT [25] 85.5 ˘1.2 Multifeature concatenation [25] 90.8 ˘0.7 LTP-HF [23] 77.6 SIFT + LTP-HF + Color histogram [23] 93.6 MS-CLBP [1] 93.When compared with CNNs, it can be found that the classification accuracy of CNNs is close to that of our method.Even though the performance of some CNNs is better than the proposed method, they need a pre-training process with a large amount of external data.Thus our method is still competitive in terms of limited requirement for external data.
The comparison results for the 19-class satellite scene dataset are listed in Table 2.It indicates that the proposed method outperforms other existing methods and achieves the best performance.The proposed method provides about 7% improvement over the method in [31] which utilized a combination of multiple sets of features, indicating the superior discriminative power of the proposed feature representation.The confusion matrix of the proposed method for the 19-class satellite scene dataset is shown in Figure 15.From diagonal elements of the matrix, the classification accuracy for bridges is relatively small because some texture information in the images of bridges is similar to those in the images of ports.

Conclusions
In this paper, an effective image representation method for remote sensing image scene classification was introduced.The proposed representation method is based on multi-scale local binary patterns features and Fisher vectors.The MS-CLBP was applied to the partitioned dense regions of an image to extract a set of local patch descriptors, which characterize the detailed structure and texture information in high-resolution remote sensing images.The Fisher vector was

Conclusions
In this paper, an effective image representation method for remote sensing image scene classification was introduced.The proposed representation method is based on multi-scale local binary patterns features and Fisher vectors.The MS-CLBP was applied to the partitioned dense regions of an image to extract a set of local patch descriptors, which characterize the detailed structure and texture information in high-resolution remote sensing images.The Fisher vector was employed to encode the local descriptors into a high-dimensional gradient representation, which can enhance the discriminative power of feature representation.Experimental results on two land-use scene datasets demonstrated that the proposed image representation approach obtained superior performance as compared to the existing methods for scene classification, with an obvious improvement such as 3% for the 21-class land-use dataset compared with the state-of-the-art MS-CLBP and 1% for the 19-class satellite scene dataset.In future work, combining global and local feature representations for remote sensing image scene classification will be investigated.

Figure 1 .
Figure 1.(a) 3 × 3 sample block; (b) the local differences; (c) the sign component of CLBP; (d) the absolute value of local differences; (e) the magnitude component of CLBP.

Figure 2
Figure 2 presents an example of the CLBP_S and CLBP_M coded images corresponding to an input aerial scene (viaduct scene).It can be observed that CLBP_S and CLBP_M operators both can capture the spatial pattern and the contrast of local image texture, such as edges and corners.

Figure 1 .
Figure 1.(a) 3 ˆ3 sample block; (b) the local differences; (c) the sign component of CLBP; (d) the absolute value of local differences; (e) the magnitude component of CLBP.

Figure 2
Figure 2 presents an example of the CLBP_S and CLBP_M coded images corresponding to an input aerial scene (viaduct scene).It can be observed that CLBP_S and CLBP_M operators both can capture the spatial pattern and the contrast of local image texture, such as edges and corners.

Figure 1 .
Figure 1.(a) 3 × 3 sample block; (b) the local differences; (c) the sign component of CLBP; (d) the absolute value of local differences; (e) the magnitude component of CLBP.

Figure 2
Figure 2 presents an example of the CLBP_S and CLBP_M coded images corresponding to an input aerial scene (viaduct scene).It can be observed that CLBP_S and CLBP_M operators both can capture the spatial pattern and the contrast of local image texture, such as edges and corners.

Figure 3 .
Figure 3.An example of the first implementation of a 3-scale CLBP operator (

Figure 4 .
Figure 4.An example of the second implementation of a 3-scale CLBP operator (

Figure 3 . 3 r
Figure 3.An example of the first implementation of a 3-scale CLBP operator (

Figure 4 .
Figure 4.An example of the second implementation of a 3-scale CLBP operator (

1 2 ,
A histogram feature vector denoted by i h is formed by concatenating the two histograms.If patches are extracted from the multi-scale images, a feature matrix denoted by [ ] ,..., M = H h h h is generated to represent the original image.Each column of the matrix H is a histogram feature vector for a patch.The proposed patch-based CLBP feature extraction method is illustrated in Figure 5.
(i.e., patch-based CLBP features) of each image are obtained using the feature extraction method illustrated in Figure 5. Since parameter sets (i.e., ) are employed for the CLBP operator, each image yields feature matrices denoted by { } 2,..., ] T j N ∈ .For each CLBP parameter set, the corresponding feature matrices of the training data are used to estimate the GMM parameters via the Expectation-Maximization (EM) algorithm.Therefore, for q CLBP parameter sets, q
For the 21-class land-use dataset, 80 images are randomly selected per class for training and the remaining images for testing.For the 19-class satellite scene dataset, 30 images per class are randomly selected for training and the remaining images for testing.Different numbers of Gaussians are examined along with different choices of multiple scales including t1, 1{r1 : 2s, ..., 1{r1 : 6su.For instance, 1{r1 : 2s indicates that scale = 1 (original image) and scale = 1/2 (down-sampled image at half of the size of the original image) are used to generate two images with two scales.Figures 9 and 10 present the classification results with different numbers of Gaussians in the GMM and different numbers of scales for the two datasets, respectively.Remote Sens. 2016, 8, 483 10 of 17 number of scales is studied for the first implementation of the MS-CLBP operator for generating multi-scale images and the number of Gaussians ( K ) in the GMM.For the 21-class land-use dataset, 80 images are randomly selected per class for training and the remaining images for testing.For the 19-class satellite scene dataset, 30 images per class are randomly selected for training and the remaining images for testing.Different numbers of Gaussians are examined along with different choices of multiple scales including {1,1 [1: 2],...,1 [1: 6]} .For instance, 1 [1: 2] indicates that scale = 1 (original image) and scale = 1/2 (down-sampled image at half of the size of the original image) are used to generate two images with two scales.Figures 9 and 10 present the classification results with different numbers of Gaussians in the GMM and different numbers of scales for the two datasets, respectively.

Figure 9 .
Figure 9. Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed method for the 21-class land-use dataset.

Figure 9 .
Figure 9. Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed method for the 21-class land-use dataset.

Figure 10 .
Figure 10.Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed method for the 19-class satellite scene dataset.

Figure 10 .
Figure 10.Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed method for the 19-class satellite scene dataset.

Figure 10 .
Figure 10.Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed method for the 19-class satellite scene dataset.

Figure 12 .
Figure 12.Per-class accuracy (%) of the proposed method and MS-CLBP+BOVW on the 21-class land-use dataset.

Figure 13 .
Figure 13.Per-class accuracy (%) of the proposed method and MS-CLBP+BOVW on the 19-class satellite scene dataset.

Figure 12 .
Figure 12.Per-class accuracy (%) of the proposed method and MS-CLBP+BOVW on the 21-class land-use dataset.

Figure 12 .
Figure 12.Per-class accuracy (%) of the proposed method and MS-CLBP+BOVW on the 21-class land-use dataset.

Figure 13 .
Figure 13.Per-class accuracy (%) of the proposed method and MS-CLBP+BOVW on the 19-class satellite scene dataset.

Figure 13 .
Figure 13.Per-class accuracy (%) of the proposed method and MS-CLBP+BOVW on the 19-class satellite scene dataset.

Figure 14 .
Figure 14.Confusion matrix of proposed method for the 21-class land-use dataset.

Figure 14 .
Figure 14.Confusion matrix of proposed method for the 21-class land-use dataset.

2 Figure 15 .
Figure 15.Confusion matrix of proposed method for the 19-class satellite scene dataset.

Figure 15 .
Figure 15.Confusion matrix of proposed method for the 19-class satellite scene dataset.

Table 2 .
Comparison of classification accuracy (%) for the 19-class satellite scene dataset.

Table 2 .
Comparison of classification accuracy (%) for the 19-class satellite scene dataset.