A Color-Texture-Structure Descriptor for High-Resolution Satellite Image Classiﬁcation

: Scene classiﬁcation plays an important role in understanding high-resolution satellite (HRS) remotely sensed imagery. For remotely sensed scenes, both color information and texture information provide the discriminative ability in classiﬁcation tasks. In recent years, substantial performance gains in HRS image classiﬁcation have been reported in the literature. One branch of research combines multiple complementary features based on various aspects such as texture, color and structure. Two methods are commonly used to combine these features: early fusion and late fusion. In this paper, we propose combining the two methods under a tree of regions and present a new descriptor to encode color, texture and structure features using a hierarchical structure-Color Binary Partition Tree (CBPT), which we call the CTS descriptor. Speciﬁcally, we ﬁrst build the hierarchical representation of HRS imagery using the CBPT. Then we quantize the texture and color features of dense regions. Next, we analyze and extract the co-occurrence patterns of regions based on the hierarchical structure. Finally, we encode local descriptors to obtain the ﬁnal CTS descriptor and test its discriminative capability using object categorization and scene classiﬁcation with HRS images. The proposed descriptor contains the spectral, textural and structural information of the HRS imagery and is also robust to changes in illuminant color, scale, orientation and contrast. The experimental results demonstrate that the proposed CTS descriptor achieves competitive classiﬁcation results compared with state-of-the-art algorithms.


Introduction
High-resolution satellite (HRS) imagery is increasingly being used to support accurate earth observations.However, the efficient combination of fine spectral, textural and structural information toward achieving reliable and consistent HRS satellite image classification remains problematic [1][2][3][4][5].This article addresses this challenge by presenting a new descriptor for object categorization and scene classification using HRS images.

Motivation and Objective
HRS images, compared to ordinary low-and medium-resolution images, have some special properties; e.g., (1) the geometry of ground objects is more distinct; (2) the spatial layout is clearer; (3) the texture information is relatively finer; and (4) the entire image is a collection of multi-scale objects.The continuous improvement of spatial resolution poses substantial challenges to traditional pixel-based spectral and texture analysis methods.The variety observed in objects' spectra and the multi-scale property differentiates HRS image classification from conventional natural image classification.In particular, this paper focuses on object categorization and scene classification using HRS images by analyzing the following two aspects: (1) Multi-resolution representation of the HRS images: An HRS image is a unification of multi-scale objects, where there are substantial large-scale objects at coarse levels as well as small objects at fine levels.In addition, given the multi-scale cognitive mechanism underlying the human visual system, which operates on the level of the object to the environment and then to the background, analysis on a single scale is insufficient for extracting all semantic objects.To represent HRS images on multiple scales, three main methods are utilized: image pyramid [6], wavelet transform [7] and hierarchical image partitions [8].However, how to consider the intrinsic properties of local objects in multi-scale image representation is a key problem worth studying.
(2) The efficient combination of various features: Color, texture and structure are reported to be discriminative and widely-used features for HRS image classification [1][2][3][4][5].An efficient combination of the three cues can help us better understand HRS images.Conventional methods using one or two features have achieved good results in image classification and retrieval, e.g., in Bag of SIFT [1] and Bag of colors [9].However, color, texture, and structure information also contribute to the understanding of the images, and image descriptors defined in different feature spaces usually help improve the performance of analyzing objects and scenes in HRS images.Thus, how to efficiently combine different features represents another key problem.

Related Works
Focusing on the two significant topics in HRS image interpretation, it is of great importance to investigate the literature on object-based image analysis, hierarchical image representation and multiple cues fusion methods.
(1) Object-based feature extraction methods for HRS images: The sematic gap is more apparent in HRS imagery, and surface objects consist of substantially richer spectral, textural and structural information.Object-based feature extraction methods enable the clustering of several homogeneous pixels and the analysis of both local and global properties; moreover, the successful development of feature extraction technologies for HRS satellite imagery has greatly increased its usefulness in many remote sensing applications [10][11][12][13][14][15][16][17][18].Blaschke et al. [10] discussed several limitations of pixel-based methods in analyzing high-resolution images and crystallized core concepts of Geographic Object Based Image Analysis.Huang and Zhang proposed an adaptive mean-shift analysis framework for object extraction and classification applied to hyperspectral imagery over urban areas, therein demonstrating the superiority of object-based methods [11].Mallinis and Koutsias presented a multi-scale object-based analysis method for classifying Quickbird images.The adoption of objects instead of pixels provided much more information and challenges for classification [12].Trias-Sanz et al. [14] investigated the combination of color and texture factors for segmenting high-resolution images into semantic regions, therein illustrating different transformed color spaces and texture features of object-based methods.Re-occurring compositions of visual primitives that indicate the relationships between different objects can be found in HRS images [19].In the framework of object based image analysis, the focus of attention is object semantics, the multi-scale property and the relationships between different objects.
(2) Hierarchical image representation for HRS images: Because an HRS image is a unification of multi-scale objects, there are substantial large-scale objects at coarse levels, such as water, forests, farmland and urban areas, as well as small targets at fine levels, e.g., buildings, cars and trees.In addition, a satellite image at different resolutions (from low to medium and subsequently to high spatial resolutions) will present different objects.Therefore it is very important to consider the object differences at different scales.Several studies have utilized Gaussian pyramid image decomposition to build a hierarchical image representation [6,20].In [6], Binaghi et al. analyzed a high-resolution scene through a set of concentric windows, and a Gaussian pyramidal resampling approach was used to reduce the computational burden.In [20], Yang and Newsam proposed a spatial pyramid co-occurrence to characterize the photometric and geometric aspects of an image.The pyramid captured both the absolute and relative spatial arrangements of objects (visual words).The obvious limitations of these approaches are the fixed regular shape and choice of the analysis window size [21].Meanwhile, some researchers employed wavelet-based methods to address the multi-scale property.Baraldi and Bruzzone used an almost complete (near-orthogonal) basis for the Gabor wavelet transform of images at selected spatial frequencies, which appeared to be superior to the dyadic multi-scale Gaussian pyramid image decomposition [7].In [22], an object's contents were represented by the object's wavelet coefficients, the multi-scale property was reflected by the coefficients in different bands, and finally, a tree structural representation was built.Observing that wavelet decomposition is a decimation of the original image and is a low-pass filter convolution of the image that lacks consideration of the relationships between objects.By fully considering the intrinsic properties of the object, some studies have relied on hierarchical segmentation and have produced hierarchical image partitions [8,23].These methods have addressed the multi-scale properties of objects [24]; however, they demonstrate few relationships between objects at different scales.Luo et al. proposed to use a topographic representation of an image to generate objects, therein considering both the spatial and structural properties [25].However, the topographic representation is typically built on the gray-level image, which rarely concerns color difference.In [26][27][28][29][30], various types of images, e.g., natural images, hyperspectral images, and PolSAR images, were represented by a hierarchical structure, namely, Binary Partition Tree (BPT), which was constructed based on particular region models and merge criteria.BPT can represent multi-scale objects from fine to coarse levels.In addition, the topological relationships between regions are translation invariant because the tree encodes the relationships between regions.Therefore we can fully consider the multi-scale, spatial structure relationship and intrinsic properties of objects using BPT representation.
(3) Multiple-cue fusion methods: Color features describe the reflective spectral information of images, and are usually encoded with statistical measures, e.g., color distributions [31][32][33][34].Texture features reflect a specific, spatially repetitive pattern of surfaces by repeating a particular visual pattern in different spatial positions [35][36][37], e.g., coarseness, contrast and regularity.For HRS images, structure features contain the macroscopic relationships between objects [38][39][40], such as adjacent relations and inclusion relations.Because structure features exist between different objects, the discussion is primarily concerned with the fusion of color and texture.There are two main fusing methods: early fusion and late fusion.Methods that combine cues prior to feature extraction are called early fusion methods [32,41,42].Methods wherein color and texture features are first separately extracted and then combined at the classifier stage are called late fusion methods [43][44][45].In [46], the authors explained the properties of the two fusion methods and concluded that classes that exhibit color-shape dependency are better represented by early fusion, whereas classes that exhibit color-shape independency are better represented by late fusion.In HRS images, classes have both color-shape dependency and independency.For example, a dark area can be water, shadow or asphalted road; in contrast, a dark area near a building with a similar contour is most likely to be a shadow.Consequently, both early and late fusion methods can be used to classify an HRS image.
Many local features have been developed to describe color, texture and structure properties, such as Color Histogram, Gabor, SIFT and HOG (Histograms of Oriented Gradients).To further improve classification accuracy, Bag of Words (BOW) [47] and Fisher vector (FV) coding [48] have been proposed to achieve more discriminative feature representation.BOW strategies have achieved great success in computer vision.Under the BOW framework, there are three representative coding and pooling methods, Spatial Pyramid Matching (SPM) [38], Spatial Pyramid Matching using Sparse Coding (ScSPM) [49] and Locality-constrained Linear Coding (LLC) [50].The traditional SPM approach uses vector quantization and multi-scale spatial average pooling and thus requires nonlinear classifiers to achieve good image classification performance.ScSPM, however, uses sparse coding and the multi-scale spatial max pooling method and thus can achieve good performance with linear classifiers.LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated via max pooling to generate the final representation.With the linear classifier, it performs remarkably better than traditional nonlinear SPM.An alternative to BOW is FV coding, which combines the strength of generative and discriminative approaches for image classification [48,51].The main idea of FV coding is to characterize the local features with a gradient vector derived from the probability density function.FV coding uses a Gaussian Mixture Model (GMM) to approximate the distribution of low-level features.Compared to the BOW, FV is not only limited to the number of occurrences of each visual word but also encodes additional information about the distribution of the local descriptors.The dimension of FV is much larger for the same dictionary size.Hence, there is no need to project the final descriptors into higher dimensional spaces with costly kernels.

Contribution of this Work
Because BPT is a hierarchical representation that fully considers multi-scale characteristics and topological relationships between regions, we propose using BPT to represent HRS images.Based on our earlier work [36] addressing texture analysis, we further implement an efficient combination of color, texture and structure features for object categorization and scene classification.In this paper, we propose a new color-texture-structure descriptor, referred to as the CTS descriptor, for HRS image classification based on the color binary partition tree (CBPT).The CBPT construction fully considers the spatial and color properties of HRS images, thereby producing a compact hierarchical structure.Then, we extract plentiful color features and texture features of local regions.Simultaneously, we analyze the CBPT structure and design co-occurrence patterns to describe the relationships of regions.Next, we encode these features by FV coding to build the CTS descriptor.Finally, we test the CTS descriptor as applied to HRS image classification.Figure 1 illustrates the flowchart of the HRS image classification process using the CTS descriptor.SPM.An alternative to BOW is FV coding, which combines the strength of generative and discriminative approaches for image classification [48,51].The main idea of FV coding is to characterize the local features with a gradient vector derived from the probability density function.
FV coding uses a Gaussian Mixture Model (GMM) to approximate the distribution of low-level features.Compared to the BOW, FV is not only limited to the number of occurrences of each visual word but also encodes additional information about the distribution of the local descriptors.The dimension of FV is much larger for the same dictionary size.Hence, there is no need to project the final descriptors into higher dimensional spaces with costly kernels.

Contribution of this Work
Because BPT is a hierarchical representation that fully considers multi-scale characteristics and topological relationships between regions, we propose using BPT to represent HRS images.Based on our earlier work [36] addressing texture analysis, we further implement an efficient combination of color, texture and structure features for object categorization and scene classification.In this paper, we propose a new color-texture-structure descriptor, referred to as the CTS descriptor, for HRS image classification based on the color binary partition tree (CBPT).The CBPT construction fully considers the spatial and color properties of HRS images, thereby producing a compact hierarchical structure.Then, we extract plentiful color features and texture features of local regions.Simultaneously, we analyze the CBPT structure and design co-occurrence patterns to describe the relationships of regions.Next, we encode these features by FV coding to build the CTS descriptor.Finally, we test the CTS descriptor as applied to HRS image classification.Figure 1 illustrates the flowchart of the HRS image classification process using the CTS descriptor.Our main contribution is the description of color and texture information based on the BPT structure.By fully considering the characteristics of CBPT, we not only build region-based hierarchical structures for HRS images, but also establish the topological relationship between regions in terms of space and scale.We present an efficient combination of color and texture via CBPT and analyze the co-occurrence patterns of objects from the connective hierarchical structure, which can effectively address the multi-scale, topological relationship and intrinsic properties of HRS images.Using the CBPT representation and the combination of color, texture and structure information, we finally achieve the combination of early and late fusion.To our knowledge, this is the first time that color, texture and structure information have been analyzed based on BPT for HRS image interpretation.
The remainder of this paper is organized as follows.Section 2 first analyzes color features and the construction of the CBPT.Texture and color feature analysis of the CBPT is presented in detail in Our main contribution is the description of color and texture information based on the BPT structure.By fully considering the characteristics of CBPT, we not only build region-based hierarchical structures for HRS images, but also establish the topological relationship between regions in terms of space and scale.We present an efficient combination of color and texture via CBPT and analyze the co-occurrence patterns of objects from the connective hierarchical structure, which can effectively address the multi-scale, topological relationship and intrinsic properties of HRS images.Using the CBPT representation and the combination of color, texture and structure information, we finally achieve the combination of early and late fusion.To our knowledge, this is the first time that color, texture and structure information have been analyzed based on BPT for HRS image interpretation.
The remainder of this paper is organized as follows.Section 2 first analyzes color features and the construction of the CBPT.Texture and color feature analysis of the CBPT is presented in detail in Section 3.Moreover, we briefly introduce the pattern design and coding method.Next, experimental results are given in Section 4, and capabilities and limitations are discussed in Section 5. Finally, the conclusions are presented in Section 6.

Color Description of HRS Image
Color description is important to the construction of CBPT and to the analysis of a region.Generally, the RGB values of the HRS images are sensitive to photometric variations.Therefore, we have to employ some color features that are invariant to undesired variations, such as shadows, specularities and illuminant color changes.Below, we briefly recap several color features applied to HRS images.
Color moment [31]: A probability distribution can be characterized by its moments based on probability theory.Thus, if the color distribution of a color region can be treated as a probability distribution, the color moment, consisting of the mean, variance and skewness, can be used to generate robust and discriminative color distribution features.An important characteristic of the color moment is that the color distribution is associated with the color space.
Hue [34]: Image regions are represented by a histogram over hue computed from the corresponding RGB values of each pixel according to hue " arctan The Hue description is based on the RGB color space, which achieves the photometric invariance.
Opponent [34]: For region-based analysis, the opponent descriptor is a histogram over the opponent angle: where O1 x and O2 x are the spatial derivatives in the chromatic opponent channels, with O1 x " 1 ? 2 pR x ´Gx q, O2 x " 1 ?6 pR x `Gx ´2B x q, in which we use a subscript to indicate spatial differentiation.The opponent angle is invariant with respect to specularities and diffuse lighting.
Color names [CN] [33]: Color names are linguistic color labels that are based on the assignment of colors in the real world.The English-language color terms include eleven basic terms: black, blue, brown, gray, green, orange, pink, purple, red, white and yellow.First, CN learns the mapping between the RGB color space and the color attributes.Then, a new RGB area is mapped to the color attribute space.The color names of region R are defined as follows: CN R " tp R pcn 1 q, p R pcn 2 q, ¨¨¨, p R pcn 11 qu (3) where cn i pi " 1, ¨¨¨, 11q is the i-th color name, N is the total pixel number of region R, and ppcn i | f pxqq is the probability of a color name given a pixel x, which is calculated by the mapping function.CN obtains a better photometric invariance because different shades of a color are mapped to the same color names.

Color Binary Partition Tree (CBPT) Construction
As a hierarchical structure, every node and every level contains semantic information.The leaf nodes represent the original pixels of an image, the root node represents the entire image, and the nodes between leaf nodes and the root represent a part or regions of the image.Moreover, a node resulting from the merger of two lower nodes is called the parent of the two nodes, and the two nodes are the siblings of each other.An important property of the CBPT is that the tree can be reconstructed using any node of the structure on the condition that we know the parent, sibling and sons of every node.There are two approaches to building the CBPT, namely, the merging approach and the splitting approach, which are standard and opposite in nature.The merging method consists of merging two regions that are most similar in the region model and that are nearest in location, which is a bottom-up approach.The split method divides one region into two complete parts that are most dissimilar, which is a top-down approach.However, it is difficult to find a separate criterion because there are numerous split methods and brute force search is computational expensive.
Because the complexity of the fusion is substantially lower than the complexity of division, our choice for constructing the CBPT is a bottom-up method.We briefly use 4 nodes to build the CBPT.From the location of A, B, C and D, we can obtain 4 pairs of adjacent nodes: (A, B), (A, D), (B, C), (C, D).These adjacent nodes are pushed into a priority queue after their similarity is measured.The top of the queue is (A, B); therefore, this pair is removed from the queue and merged to form E. When updating the adjacent list, E is the neighborhood of C and D; thus, (E, C) and (E, D) are pushed into the queue.In the ordered queue, we find that (C, D) is most similar; thus, they are popped out to form F. At this point, A, B, C and D have all been used, and the last pair to merge is (E, F).As a result, G represents the entire image.The schematic map is illustrated in Figure 2.

Color Binary Partition Tree (CBPT) Construction
As a hierarchical structure, every node and every level contains semantic information.The leaf nodes represent the original pixels of an image, the root node represents the entire image, and the nodes between leaf nodes and the root represent a part or regions of the image.Moreover, a node resulting from the merger of two lower nodes is called the parent of the two nodes, and the two nodes are the siblings of each other.An important property of the CBPT is that the tree can be reconstructed using any node of the structure on the condition that we know the parent, sibling and sons of every node.There are two approaches to building the CBPT, namely, the merging approach and the splitting approach, which are standard and opposite in nature.The merging method consists of merging two regions that are most similar in the region model and that are nearest in location, which is a bottom-up approach.The split method divides one region into two complete parts that are most dissimilar, which is a top-down approach.However, it is difficult to find a separate criterion because there are numerous split methods and brute force search is computational expensive.
Because the complexity of the fusion is substantially lower than the complexity of division, our choice for constructing the CBPT is a bottom-up method.We briefly use 4 nodes to build the CBPT.From the location of A, B, C and D, we can obtain 4 pairs of adjacent nodes: (A, B), (A, D), (B, C), (C, D).These adjacent nodes are pushed into a priority queue after their similarity is measured.The top of the queue is (A, B); therefore, this pair is removed from the queue and merged to form E. When updating the adjacent list, E is the neighborhood of C and D; thus, (E, C) and (E, D) are pushed into the queue.In the ordered queue, we find that (C, D) is most similar; thus, they are popped out to form F. At this point, A, B, C and D have all been used, and the last pair to merge is (E, F).As a result, G represents the entire image.The schematic map is illustrated in Figure 2. From the example above, we find that the priority queue is very important to the construction of the CBPT.However, the measurement of the similarity of two spatially neighboring regions represents the most important problem.This question calls upon two important concepts: the region model and similarity measurement.As mentioned in Section 2.1, the similarity between two regions can be quantized either in a three-color space or through the use of color features.High-dimensional color features, such as Hue [34] and CN [33], lead to high computational complexity, which reduces the efficiency of CBPT construction.Therefore, the three-channel color space would provide higher performance, despite the distance precision possibly not being as accurate as the high-dimension features.Khan and van de Weijer discussed the distance precision for approximately 11 color features [52].The results showed that high-level color features, e.g., CN [33] and Opp [34], obtain the highest distance precision.However, their high-dimensional property results in high computational complexity in building CBPTs.Among the three-channel color features, HSV provides the highest distance precision, being comparable to that of high-dimensional color features [52].Pursuing a compromise between computational complexity and accuracy, we finally choose HSV to build the From the example above, we find that the priority queue is very important to the construction of the CBPT.However, the measurement of the similarity of two spatially neighboring regions represents the most important problem.This question calls upon two important concepts: the region model and similarity measurement.As mentioned in Section 2.1, the similarity between two regions can be quantized either in a three-color space or through the use of color features.High-dimensional color features, such as Hue [34] and CN [33], lead to high computational complexity, which reduces the efficiency of CBPT construction.Therefore, the three-channel color space would provide higher performance, despite the distance precision possibly not being as accurate as the high-dimension features.Khan and van de Weijer discussed the distance precision for approximately 11 color features [52].The results showed that high-level color features, e.g., CN [33] and Opp [34], obtain the highest distance precision.However, their high-dimensional property results in high computational complexity in building CBPTs.Among the three-channel color features, HSV provides the highest distance precision, being comparable to that of high-dimensional color features [52].Pursuing a compromise between computational complexity and accuracy, we finally choose HSV to build the region models.For consistency, pixels are treated as regions.We denote the model of region R by M R , which is the M R based on the HSV space: where N R is the number of pixels in region R and Ippq " tH, S, Vu.The model of the regions is a three-dimensional vector and contains the mean of the three channels of all pixels contained in the region.This model typically describes the average intensity of every channel.Thus, we calculate the similarity based on the weighted difference of all channels.The similarity measure is calculated for each pair of neighboring regions, and the merging criterion is used to choose the neighboring pair of regions that are most similar.The weighted Euclidean distance (WED) is used to measure the similarity [17,18].In the following, it is assumed that two neighboring regions, denoted by R 1 and R 2 , with region models M R 1 and M R 2 and region sizes of N R 1 and N R 2 pixels, respectively, are evaluated based on the dissimilarity measure d, which is denoted by d pR 1 , R 2 q.Assuming that the region R 1 Y R 2 represents the merged area of R 1 and R 2 , the model is denoted by M R 1 YR 2 .The WED between region models is defined as As can be inferred from Equation ( 5), region models are size-independent measures.To produce uniform large regions, the WED utilizes the weighted distance based on the size and compares the models of the original region with the obtained merged region.The obtained model is approximated as The approximation of the obtained model provides a compromise between efficiency and accuracy.To further enhance the efficiency of CBPT construction, a priority queue is established using all pairs of neighboring regions.If a new pair of regions enters the queue, the position or the order is determined by the distance of the two regions (WED).The top of the queue, which consists of a pair of neighboring regions that are most similar, is popped out for merging.Note that one region has many neighborhoods.Therefore, if a region has been used to generate a new region, all pairs of regions that contain this region will no longer be used.
A segmentation experiment based on color homogeneity is conducted to show the results of the CBPT construction.Segmentation is a process used to prune the tree, resulting in a complete regional expression.
h " Figure 3 shows the multi-scale segmentation results of an HRS image via CBPT.The image is represented by fine texture and numerous tiny objects at the fine level, sparse texture and large homogeneous areas at the coarse level.We can observe that the regions of interest are represented by clear contours, e.g., the aircraft.In contrast, the background of the airport is segmented into dense small regions at fine levels, and large homogeneous areas are at the coarse level.

Shape-Based Invariant Texture Analysis (SITA)
Common geometric properties can be found in the same category when performing scene classification.Therefore, the textures of semantic regions are similar.The hierarchical structure of the BPT provides multi-scale region representations; thus, the modeling of texture is converted to describe the node (region) of the BPT.The texture description first relies on classical shape moments and then uses the hierarchical structure of the BPT [36].
The shape moments of a region are defined as where ( , ) x y is the centroid of region s.Based on the shape moments, the employed texture attributes are listed as follows. 1 λ and 2 λ denote the two eigenvalues of the normalized inertia matrix of s, with 1 2 λ λ ≥ ; a is the region's area; p is the region's perimeter; (s) is the r-th ancestor of region s in the BPT; and min max , a a are two thresholds on the shape area.
(1) Elongation, which defines the aspect ratio of the region: (2) Orientation, which defines the angle between the major and minor axes: (3) Rectangularity, which defines to what extent a region is rectangular: where w and l are the width and height, respectively, of the minimum bounding rectangle.(4) Circle-compactness, as with the rectangularity: (5) Eclipse-compactness, as with the rectangularity: Common geometric properties can be found in the same category when performing scene classification.Therefore, the textures of semantic regions are similar.The hierarchical structure of the BPT provides multi-scale region representations; thus, the modeling of texture is converted to describe the node (region) of the BPT.The texture description first relies on classical shape moments and then uses the hierarchical structure of the BPT [36].
The shape moments of a region are defined as u pq " ÿ IPS px I ´xI q p py I ´yI q q dx I dy I (9) where px, yq is the centroid of region s.Based on the shape moments, the employed texture attributes are listed as follows.λ 1 and λ 2 denote the two eigenvalues of the normalized inertia matrix of s, with λ 1 ě λ 2 ; a is the region's area; p is the region's perimeter; I(s) are the pixels in region s; s r , r P [1,¨¨¨, M] is the r-th ancestor of region s in the BPT; and a min , a max are two thresholds on the shape area.
(1) Elongation, which defines the aspect ratio of the region: (2) Orientation, which defines the angle between the major and minor axes: (3) Rectangularity, which defines to what extent a region is rectangular: where w and l are the width and height, respectively, of the minimum bounding rectangle.(4) Circle-compactness, as with the rectangularity: (5) Eclipse-compactness, as with the rectangularity: (6) Scale ratio, which defines the relationship between the current region s and its former r ancestors: (7) Normalized area: θ " Ina ´Ina min Ina max ´Ina min (16) In summary, the 7 above-mentioned types of geometry features describe the texture information of regions from different aspects.Therefore, these features are concatenated to improve the discriminative ability of the final descriptor.

Color Features
SITA shows the texture attributes for regions in the CBPT; however, significant color information has not been exploited.In Section 2, we introduced several color spaces and different color features, and we used HSV to model the regions when building the CBPT.Using the color region model, we investigate color moments and color names.The average value distribution of small regions and the variance as well as the skewness of large regions can reflect the discrimination of different scenes.The experiments in [34] suggested that color names provide the best performance in terms of object detection; however, color names are designed for natural images, and thus they are not suitable for HRS images.Therefore, we use color moments to describe the spectral information of objects.
When specifically describing the details of every channel, the color distribution model is equivalent to the probability distribution model.We first use classical color moments to describe color features.The color moments are defined as follows: (1) Normalized average: (2) Variance: (3) Skewness: In addition, the segmentation experiment in Section 2.2 shows that the color homogeneity h (Equation ( 8)) also provides useful information for describing the overall similarity of three channels.As a result, we use 10 color attributes to describe color features.
To summarize, color moments in the HSV space are in accordance with the HSV region models in the CBPT construction.This method provides a perfect transition from BPT creation to color description, which ideally combines early and late fusion methods.

Pattern Design and Structure Analysis
Visual patterns represent the re-occurring composition of visual attributes and extract the essence of an image, which conveys rich information [19].Because our CBPT is a bottom-up hierarchical structure, the spatial co-occurrences of image regions can contribute to better scene representation [53].Furthermore, patterns in BPT represent the relationships between different objects in HRS images.A contained relationship, such as a tree being a subset of a forest, is called Pattern P 2 .P 3 is an extension of P 2 , such as an airport on an island, where the island is surrounded by water.The adjacent relation, such as between an island and its surrounding water, is called Pattern P 4 .A variety of objects on the ground have positions within their environment and have links to other objects.We design these co-occurrence patterns to analyze the distribution of ground objects.Based on the binary composition structure, we explore the 4 co-occurrence patterns [36] in Table 1.

Pattern Design and Structure Analysis
Visual patterns represent the re-occurring composition of visual attributes and extract the essence of an image, which conveys rich information [19].Because our CBPT is a bottom-up hierarchical structure, the spatial co-occurrences of image regions can contribute to better scene representation [53].Furthermore, patterns in BPT represent the relationships between different objects in HRS images.A contained relationship, such as a tree being a subset of a forest, is called Pattern P2.P3 is an extension of P2, such as an airport on an island, where the island is surrounded by water.The adjacent relation, such as between an island and its surrounding water, is called Pattern P4.A variety of objects on the ground have positions within their environment and have links to other objects.We design these co-occurrence patterns to analyze the distribution of ground objects.Based on the binary composition structure, we explore the 4 co-occurrence patterns [36] in Table 1.

Patterns Definition Schematic Map
Single region ( 1 Region-parent-sibling ( 4 The 4 above-described patterns provide a dense local feature collection for the analyzed regions.In summary, the attributes of region R are defined as , , , , , , , ,s , }, 1,2,3 The 4 local features of co-occurrence patterns are as follows: Features of As a result, all patterns of the CBPT structure describe the image based on different aspects.For compactness, the final description of an image is the concatenation of all patterns.

Color-Texture-Structure Descriptor Generation
After we extract the texture and color features of all the image regions (amin < size < amax), these local color and texture features are found to be numerous and redundant.To maximize classification accuracy while minimizing computational effort, we then use encoding technologies to obtain more discriminative feature representation.Typically, we explore two encoding strategies: Locality constrained linear coding based on BOW [50] and the FV coding method [54].

Locality-Constrained Linear Coding
Locality-constrained linear coding (LLC) [50] is used to encode local descriptors (color, texture and structure) into more discriminative descriptors.LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated via max pooling to generate the final representation.We first use K-means to create dictionaries, and the number of cluster centers is set as M.Then, LLC is utilized to project each descriptor onto its local

Pattern Design and Structure Analysis
Visual patterns represent the re-occurring composition of visual attributes and extract the essence of an image, which conveys rich information [19].Because our CBPT is a bottom-up hierarchical structure, the spatial co-occurrences of image regions can contribute to better scene representation [53].Furthermore, patterns in BPT represent the relationships between different objects in HRS images.A contained relationship, such as a tree being a subset of a forest, is called Pattern P2.P3 is an extension of P2, such as an airport on an island, where the island is surrounded by water.The adjacent relation, such as between an island and its surrounding water, is called Pattern P4.A variety of objects on the ground have positions within their environment and have links to other objects.We design these co-occurrence patterns to analyze the distribution of ground objects.Based on the binary composition structure, we explore the 4 co-occurrence patterns [36] in Table 1.

Patterns Definition Schematic Map
Single region ( 1 Region-ancestor-ancestor ( 3 Region-parent-sibling ( 4 The 4 above-described patterns provide a dense local feature collection for the analyzed regions.In summary, the attributes of region R are defined as , , , , , , , ,s , }, 1,2,3 The 4 local features of co-occurrence patterns are as follows: Features of As a result, all patterns of the CBPT structure describe the image based on different aspects.For compactness, the final description of an image is the concatenation of all patterns.

Color-Texture-Structure Descriptor Generation
After we extract the texture and color features of all the image regions (amin < size < amax), these local color and texture features are found to be numerous and redundant.To maximize classification accuracy while minimizing computational effort, we then use encoding technologies to obtain more discriminative feature representation.Typically, we explore two encoding strategies: Locality constrained linear coding based on BOW [50] and the FV coding method [54].

Locality-Constrained Linear Coding
Locality-constrained linear coding (LLC) [50] is used to encode local descriptors (color, texture and structure) into more discriminative descriptors.LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated via max pooling to generate the final representation.We first use K-means to create dictionaries, and the number of cluster centers is set as M.Then, LLC is utilized to project each descriptor onto its local dictionaries.The LLC optimization problem is to minimize Region-ancestor-ancestor (P 3 ) R ´Rr -R 2r Remote Sens. 2016, 8, 259 10 of 24

Pattern Design and Structure Analysis
Visual patterns represent the re-occurring composition of visual attributes and extract the essence of an image, which conveys rich information [19].Because our CBPT is a bottom-up hierarchical structure, the spatial co-occurrences of image regions can contribute to better scene representation [53].Furthermore, patterns in BPT represent the relationships between different objects in HRS images.A contained relationship, such as a tree being a subset of a forest, is called Pattern P2.P3 is an extension of P2, such as an airport on an island, where the island is surrounded by water.The adjacent relation, such as between an island and its surrounding water, is called Pattern P4.A variety of objects on the ground have positions within their environment and have links to other objects.We design these co-occurrence patterns to analyze the distribution of ground objects.Based on the binary composition structure, we explore the 4 co-occurrence patterns [36] in Table 1.

Patterns Definition Schematic Map
Single region ( 1 Region-ancestor-ancestor ( 3 Region-parent-sibling ( 4 The 4 above-described patterns provide a dense local feature collection for the analyzed regions.In summary, the attributes of region R are defined as , , , , , , , ,s , }, 1,2,3 The 4 local features of co-occurrence patterns are as follows: Features of As a result, all patterns of the CBPT structure describe the image based on different aspects.For compactness, the final description of an image is the concatenation of all patterns.

Color-Texture-Structure Descriptor Generation
After we extract the texture and color features of all the image regions (amin < size < amax), these local color and texture features are found to be numerous and redundant.To maximize classification accuracy while minimizing computational effort, we then use encoding technologies to obtain more discriminative feature representation.Typically, we explore two encoding strategies: Locality constrained linear coding based on BOW [50] and the FV coding method [54].

Locality-Constrained Linear Coding
Locality-constrained linear coding (LLC) [50] is used to encode local descriptors (color, texture and structure) into more discriminative descriptors.LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated via max pooling to generate the final representation.We first use K-means to create dictionaries, and the number of cluster centers is set as M.Then, LLC is utilized to project each descriptor onto its local dictionaries.The LLC optimization problem is to minimize Remote Sens. 2016, 8, 259 10 of 24

Pattern Design and Structure Analysis
Visual patterns represent the re-occurring composition of visual attributes and extract the essence of an image, which conveys rich information [19].Because our CBPT is a bottom-up hierarchical structure, the spatial co-occurrences of image regions can contribute to better scene representation [53].Furthermore, patterns in BPT represent the relationships between different objects in HRS images.A contained relationship, such as a tree being a subset of a forest, is called Pattern P2.P3 is an extension of P2, such as an airport on an island, where the island is surrounded by water.The adjacent relation, such as between an island and its surrounding water, is called Pattern P4.A variety of objects on the ground have positions within their environment and have links to other objects.We design these co-occurrence patterns to analyze the distribution of ground objects.Based on the binary composition structure, we explore the 4 co-occurrence patterns [36] in Table 1.

Patterns Definition Schematic Map
Single region ( 1 Region-ancestor-ancestor ( 3 Region-parent-sibling ( 4 The 4 above-described patterns provide a dense local feature collection for the analyzed regions.In summary, the attributes of region R are defined as 1 2 3 ( ) { , , , , , , , , ,s , }, 1,2,3 The 4 local features of co-occurrence patterns are as follows: Features of As a result, all patterns of the CBPT structure describe the image based on different aspects.For compactness, the final description of an image is the concatenation of all patterns.

Color-Texture-Structure Descriptor Generation
After we extract the texture and color features of all the image regions (amin < size < amax), these local color and texture features are found to be numerous and redundant.To maximize classification accuracy while minimizing computational effort, we then use encoding technologies to obtain more discriminative feature representation.Typically, we explore two encoding strategies: Locality constrained linear coding based on BOW [50] and the FV coding method [54].

Locality-Constrained Linear Coding
Locality-constrained linear coding (LLC) [50] is used to encode local descriptors (color, texture and structure) into more discriminative descriptors.LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated via max pooling to generate the final representation.We first use K-means to create dictionaries, and the number of cluster centers is set as M.Then, LLC is utilized to project each descriptor onto its local dictionaries.The LLC optimization problem is to minimize The 4 above-described patterns provide a dense local feature collection for the analyzed regions.In summary, the attributes of region R are defined as The 4 local features of co-occurrence patterns are as follows: Features of P 1 : f 1 " r f pRqs; Features of P 2 : f 2 " r f pRq, f pR r qs; Features of P 3 : f 3 " r f pRq, f pR r q, f pR 2r qs; Features of P 4 : f 4 " r f pRq, f pR 1 q, f pR 1 qs.
As a result, all patterns of the CBPT structure describe the image based on different aspects.For compactness, the final description of an image is the concatenation of all patterns.

Color-Texture-Structure Descriptor Generation
After we extract the texture and color features of all the image regions (a min < size < a max ), these local color and texture features are found to be numerous and redundant.To maximize classification accuracy while minimizing computational effort, we then use encoding technologies to obtain more discriminative feature representation.Typically, we explore two encoding strategies: Locality constrained linear coding based on BOW [50] and the FV coding method [54].

Locality-Constrained Linear Coding
Locality-constrained linear coding (LLC) [50] is used to encode local descriptors (color, texture and structure) into more discriminative descriptors.LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated via max pooling to generate the final representation.We first use K-means to create dictionaries, and the number of cluster centers is set as M.Then, LLC is utilized to project each descriptor onto its local dictionaries.The LLC optimization problem is to minimize min where F p " rf The final pattern descriptor of an image is converted into 1 ˆM code.More specifically, LLC performs a K-nearest neighbor search and solves a small constrained least square fitting problem.Then the local descriptors are transformed into sparse code.Multi-scale spatial pyramid max pooling over the sparse codes is subsequently used to obtain the final features.

FV Coding
FV coding [54] uses a Gaussian mixture model to approximate the distribution of low-level features and considers the mean as well as the variance.FV coding is used to characterize the local features with a gradient vector derived from a probability density function.Denote the global descriptors of the pattern p by F p " rf LpF, Θq " Assuming u λ is a dense function of a Gaussian, ˘" tw m , µ m , δ m u contains the weighting coefficient, mean and variance.Then, the descriptors of pattern p can be fit as follows: where m " 1, 2, ¨¨¨, M, with M being the number of Gaussians, also called the dictionary size.Based on Bayes formula, the probability for f p t being generated by the i-th Gaussian is donated by γ t (i): As a result, the FV Gpf, λq is computed as the concatenation of two vectors: Assuming F p is of D dimension, for each Gaussian, the FV has dimensions 2 ˆD.Therefore, the final descriptor has dimensions 2 ˆD ˆM.To reduce the feature dimension, we use PCA to compress the descriptor.The CTS descriptor is defined as H " rhpP 1 q, hpP 2 q, hpP 3 q, hpP 4 qs.

Experimental Results
We validate the performance of the CTS descriptor on two different datasets.The first dataset is an object-based scene: the 21-class UC Merced dataset [55], which was manually generated from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the United States.The pixel resolution of this public domain imagery is approximately 0.30 m.The second dataset contains two large HRS scenes: a large scene of Tongzhou (Scene-TZ) [56] and a large scene near the Tucson airport (Scene-AT) [57], which were both captured by the GeoEye-1 satellite sensor.The GeoEye-1 satellite includes a high-resolution CCD camera, which acquires images with a spatial resolution up to 0.41 m in the panchromatic band and of up to 1.65 m in the multi-spectral band.In each experiment, we first introduce the dataset and experimental settings and then provide the results.We utilize the 21-class data set to test the coding method and color features.To generate more discriminative high-level descriptors, we compare the FV coding method [54] with the classical BOW [58] and LLC methods [50].To further demonstrate the efficiency of our method, we compare the CBPT with the Gray-BPT and the topographic map and subsequently compare color moments with color names.Next, we compare our CTS descriptor with other popular descriptors, such as BOVW (bag of SIFT), SC+ Pooling, and bag of colors.The two large satellite scene classification experiments first provide the direct visual effects of the classification result, which are then used for comparison with some popular satellite image classification methods.The final CTS descriptor is the histogram of all patterns based on color and texture.Therefore, it is very efficient to use the histogram intersection kernel (HIK) to calculate the similarity between different CTS descriptors.Compared to the linear kernel, polynomial kernel and radial basis function (RBF) kernel, the HIK-based support vector machine (SVM) achieves the best classification results for the histogram-based descriptors [59].In addition, HIK SVM is also widely-used to compare BOW models.The kernel is defined as where h i pP k qrts is the t-th bin of the histogram h i pP k q and T k is the number of bins.

Experiments on UC Merced Dataset
The UC Merced dataset, which is a very challenging object-based HRS image dataset, has been widely used in HRS scene classification.A total of 100 images measuring 256 ˆ256 pixels were manually selected for each of the following 21 classes: agricultural, airplane, baseball, diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium-density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.Two typical examples of each class are shown in Figure 4.

Experimental Results
We validate the performance of the CTS descriptor on two different datasets.The first dataset is an object-based scene: the 21-class UC Merced dataset [55], which was manually generated from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the United States.The pixel resolution of this public domain imagery is approximately 0.30 m.The second dataset contains two large HRS scenes: a large scene of Tongzhou (Scene-TZ) [56] and a large scene near the Tucson airport (Scene-AT) [57], which were both captured by the GeoEye-1 satellite sensor.The GeoEye-1 satellite includes a high-resolution CCD camera, which acquires images with a spatial resolution up to 0.41 m in the panchromatic band and of up to 1.65 m in the multi-spectral band.In each experiment, we first introduce the dataset and experimental settings and then provide the results.We utilize the 21-class data set to test the coding method and color features.To generate more discriminative high-level descriptors, we compare the FV coding method [54] with the classical BOW [58] and LLC methods [50].To further demonstrate the efficiency of our method, we compare the CBPT with the Gray-BPT and the topographic map and subsequently compare color moments with color names.Next, we compare our CTS descriptor with other popular descriptors, such as BOVW (bag of SIFT), SC+ Pooling, and bag of colors.The two large satellite scene classification experiments first provide the direct visual effects of the classification result, which are then used for comparison with some popular satellite image classification methods.The final CTS descriptor is the histogram of all patterns based on color and texture.Therefore, it is very efficient to use the histogram intersection kernel (HIK) to calculate the similarity between different CTS descriptors.Compared to the linear kernel, polynomial kernel and radial basis function (RBF) kernel, the HIK-based support vector machine (SVM) achieves the best classification results for the histogram-based descriptors [59].In addition, HIK SVM is also widely-used to compare BOW models.The kernel is defined as T is the number of bins.

Experiments on UC Merced Dataset
The UC Merced dataset, which is a very challenging object-based HRS image dataset, has been widely used in HRS scene classification.A total of 100 images measuring 256 × 256 pixels were manually selected for each of the following 21 classes: agricultural, airplane, baseball, diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium-density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.Two typical examples of each class are shown in Figure 4.  To conduct the classification experiments, the number of randomly selected training samples per class is set to 80 images, and the remaining 20 samples are retained for testing.Once the popular descriptors' experimental settings on the UC Merced dataset are applied (BOVW, SPM, Bag of colors etc.), there are 420 images that remain for testing.To ensure a fair comparison, we use the training set to train an SVM classifier with the HIK and the remaining images for testing; moreover, the parameters are set as recommended by the authors [59], i.e., following the procedure in [20], where five-fold cross-validation is performed.The dataset is first randomly split into five equal sets; then, the classifier is trained on four of the sets and evaluated on the held-out set.The average classification accuracy and standard variance are computed over the 200 evaluations.

Coding Method Comparison
To generate discriminative high-level features, we compare the FV representation [54] with the BOW representation.During the comparison of coding methods, the low-level local descriptors, 7 shape features and 10 color features were held constant, which means that we used the same collection of regions to generate local descriptors.To ensure a fair comparison, the parameters are tuned to obtain the best results.Table 2 shows the classification results based on these three coding methods.Compared with the BOW model, which uses vector quantization (VQ) with an average pooling method [38] and LLC with max pooling methods [50], the FV coding method exhibited the best performance.The reason for the better results obtained using LLC is that LLC utilizes both the locality constraints and sparsity constraints to project each descriptor onto its local coordinate system, and the coding method produces substantially better results because FV uses a mixture of Gaussians to model (GMM) the local descriptors, thereby obtaining fewer dictionaries although with better descriptions of the data distribution.

Fusion Strategy Comparison
In addition, to further illustrate the effect of the fusion strategy of the CTS descriptor, we compare the CTS descriptor with the fusion of textures and color names via color BPT, the fusion of textures and color moments via topographic maps [25] and the late fusion of textures and color names via topographic maps.Note that the shape texture analysis based on topographic maps has also achieved good results [36], and we must determine if this method continues to perform well in HRS image classification.In addition, by considering the color information, we perform a late fusion of textures and color moments.Because the BPT structure is very different from topographic maps in terms of construction, we also create a BPT structure based on gray-level images.As a result, we first use only texture descriptors to perform classification based on three hierarchical structures: topographic maps, gray BPT and CBPT.Table 3 shows that the classification results based on CBPT are the best, thereby illustrating that the structure combined with color information is more discriminative.In addition, color moments are then added as parts of descriptors, and the method based on CBPT understandably achieves the best results.Compared to the late fusion of color moments and texture based on topographic map, the method based on CBPT performs slightly better.We also find CBPT to be superior in describing texture features, which could be verified by the results of texture analysis based on topographic map, gray BPT and CBPT.In addition, because the use of color names provides a good performance when applied to object detection [33], we perform a fusion of color names and texture descriptors.However, the results show that the color moments provide better performance in combination with our CBPT construction method.Moreover, there are several parameters that affect the classification results, including (1) the minimum size of the regions and (2) the dictionary size of each pattern.We analyze the effects of the parameters on the classification results and then choose suitable parameters.The minimum region size determines whether the regions are taken into calculating the local features or not.Only the regions larger than this parameter are used to calculate local features.Fast feature extraction prefers large minimum region size, the larger the faster.However, too big minimum segment size will dramatically decrease the performance of the method due to the fact that the texture-color cues are local statistics of images and too big minimum segment size will fast destroy the local property of the representation.Thus, the selections of the minimum segment size should be a trade-off between the implement efficiency and the discriminative power of the CTS descriptor.We test a series of minimum region sizes on CTS and the classification results are listed in Table 4. From the table we can observe that different trees have different minimum segment region sizes: 6 is a good choice for topographic maps while 15 is the best for CBPT.Observe our classification results are averaged over 200 repeated experiments, which eliminates the influence of accidental factors.From Table 4, it is worth noticing that our method is robust to the minimum segment size, when it ranges in a reasonable size.Table 5 illustrates that the proper dictionary size is 100, where the dictionary size means the number of cluster centers.Because we randomly select a limited number of images (10 images) to obtain the dictionary, the classification results may change slightly.We repeat the experiment 5 times (i.e., select different samples to obtain the dictionary) to obtain the average accuracy and the standard deviation.

Classification Result Comparison
We extract the 17 local features based on CBPT, and the smallest regions for extracting features are set as 15 pixels.The dictionary size of FV encoding is 100.Thus, the proposed CTS descriptor provides a good classification result.Table 6 illustrates that the CTS descriptor outperforms the state-of-the-art algorithms on the UC Merced dataset.The BOVW (bag of SIFT) algorithm is a high-level feature that encodes SIFT descriptors [60].The local descriptor SIFT is very discriminative but not sufficiently semantic.The Bag of colors method simply uses the color information, and the region is not semantic.This demonstrates the advantage of our descriptors, which combine a hierarchical region-based method with color and texture fusion.Furthermore, we compare the CTS descriptor with the latest well-designed method.Although HMFF includes hand-crafted, carefully designed features, the strategy based on the hierarchical fusion of multiple features results in a classification accuracy that is comparable with our results.The unsupervised feature learning method UFL-SC uses a low-dimensional image patch manifold learning technique and focuses on effective dictionary learning and feature encoding, which provides an alternative method for analyzing local features; however, the classification results are not sufficiently encouraging.

Classification Results
BOVW [20] 71.86 SPM [38] 74 SC+ Pooling [61] 81.67 ˘1.23 Bag of colors [62] 83.46 ˘1.57COPD [63] 91.33 ˘1.11 HMFF [62] 92.38 ˘0.62 UFL-SC [64] 90.26 ˘1.51 SAL-LDA [65] 88.33 ˘1.15 CTS 93.08 ˘1.13 More precisely, we analyze the confusion matrix of the classification results based on the CTS descriptor.To obtain more stable results, we repeat the classification experiments 200 times.Then, the confusion vector of class i is defined as where sum i pjq is the number of images that belong to class i but that are misclassified as class j and sum i psamplesq is the number of testing samples in class i.
Figure 5 displays the confusion matrix of the CTS descriptor on the UC Merced dataset.As observed in the confusion matrix, there is some confusion between certain scenes.Because the color information and texture information of the tennis court are likely to be confused with those of the baseball diamond, buildings, dense residential area, intersection, medium residential area, sparse residential area and storage banks, the identified positive samples for the tennis court present the greatest confusion.The overpass and freeway are two classes that are likely to be misclassified, with the misclassification rate reaching 7.3% because an overpass is part of a freeway; moreover, we cannot simply use texture and color information to separate them.

Methods
Classification Results BOVW [20] 71.86 SPM [38] 74 SC+ Pooling [61] 81.67 ± 1.23 Bag of colors [62] 83.46 ± 1.57 COPD [63] 91.33 ± 1.11 HMFF [62] 92.38 ± 0.62 UFL-SC [64] 90.26 ± 1.51 SAL-LDA [65] 88.33 ± 1.15 CTS 93.08 ± 1.13 More precisely, we analyze the confusion matrix of the classification results based on the CTS descriptor.To obtain more stable results, we repeat the classification experiments 200 times.Then, the confusion vector of class i is defined as where ( ) i sum j is the number of images that belong to class i but that are misclassified as class j and (samples) i sum is the number of testing samples in class i.
Figure 5 displays the confusion matrix of the CTS descriptor on the UC Merced dataset.As observed in the confusion matrix, there is some confusion between certain scenes.Because the color information and texture information of the tennis court are likely to be confused with those of the baseball diamond, buildings, dense residential area, intersection, medium residential area, sparse residential area and storage banks, the identified positive samples for the tennis court present the greatest confusion.The overpass and freeway are two classes that are likely to be misclassified, with the misclassification rate reaching 7.3% because an overpass is part of a freeway; moreover, we cannot simply use texture and color information to separate them.

Experiments on Large Satellite Scenes
To further demonstrate the discriminative ability and robustness of the CTS descriptor, we apply our descriptor to two large satellite scene images.

Experiments on Large Satellite Scenes
To further demonstrate the discriminative ability and robustness of the CTS descriptor, we apply our descriptor to two large satellite scene images.4.2.1.Experiments on Scene-TZ Scene-TZ [56] is a 4000 ˆ4000-pixel HRS scene that was taken over the Majuqiao Town of southwest Tongzhou District in Beijing.The original image and the actual geographic location are shown in Figure 6.There are 8 semantic classes in Scene-TZ: bare land, low buildings, factories, high buildings, farmland, green land, road and water, where each class has some similar texture and color information.We show one sample per class in Figure 7a, and the hand-labeled ground reference data are shown in Figure 7b.Scene-TZ [56] is a 4000 × 4000-pixel HRS scene that was taken over the Majuqiao Town of southwest Tongzhou District in Beijing.The original image and the actual geographic location are shown in Figure 6.There are 8 semantic classes in Scene-TZ: bare land, low buildings, factories, high buildings, farmland, green land, road and water, where each class has some similar texture and color information.We show one sample per class in Figure 7a, and the hand-labeled ground reference data are shown in Figure 7b.First, we divide the large satellite image into non-overlapping sub-images with a size of 100 × 100 pixels.As a result, Scene-TZ is divided into 1600 patches.To assess the classification results, we label each patch with a corresponding semantic category.Because this approach is different from randomly choosing training samples in object categorization, we manually select 10 typical samples for each class as a training set for large satellite image scene classification, because if the training samples are distributed at random, the samples will be uniformly distributed over the entire image.Thus, they are used as seeds to classify all other patches.Note that the whole image may be characterized by inhomogeneity, thus, distributing the training samples uniformly over the whole image simplifies the problem.On the other hand, the end users often manually select some typical samples from the image for each class, e.g., ENVI and e-Cognition.It would be preferable to use completely independent images for training and testing to observe the robustness of the CTS descriptor.In addition, we ensure that the training samples stay the same when applied to other stateof-the-art methods.
Table 7 shows the classification results on Scene-TZ.We perform a comparison with several features combining color, texture and structure information: (1) OF, the basic feature concatenation of SIFT [60], CS [66], and BOC [62]; (2) EP [67], with features learned via unsupervised ensemble projection of SIFT,CS and BOC; and (3) SSEP [56], with features learned via semi-supervised ensemble projection of SIFT, CS and BOC.From Table 7, we can observe that our CTS descriptor outperforms all the other features.With the same training samples, the CTS descriptor obtains an improved performance compared to SSEP of approximately 10.85%.In addition, to exclude the effect of different classifiers, we utilize a logic regression (LR) classifier based on the CTS descriptor, which is used in OF, EP and SSEP.The classification results based on logic regression are poorer than the results based on HIK SVM but are also substantially improved compared to the results obtained using First, we divide the large satellite image into non-overlapping sub-images with a size of 100 ˆ100 pixels.As a result, Scene-TZ is divided into 1600 patches.To assess the classification results, we label each patch with a corresponding semantic category.Because this approach is different from randomly choosing training samples in object categorization, we manually select 10 typical samples for each class as a training set for large satellite image scene classification, because if the training samples are distributed at random, the samples will be uniformly distributed over the entire image.Thus, they are used as seeds to classify all other patches.Note that the whole image may be characterized by inhomogeneity, thus, distributing the training samples uniformly over the whole image simplifies the problem.On the other hand, the end users often manually select some typical samples from the image for each class, e.g., ENVI and e-Cognition.It would be preferable to use completely independent images for training and testing to observe the robustness of the CTS descriptor.In addition, we ensure that the training samples stay the same when applied to other state-of-the-art methods.
Table 7 shows the classification results on Scene-TZ.We perform a comparison with several features combining color, texture and structure information: (1) OF, the basic feature concatenation of SIFT [60], CS [66], and BOC [62]; (2) EP [67], with features learned via unsupervised ensemble projection of SIFT, CS and BOC; and (3) SSEP [56], with features learned via semi-supervised ensemble projection of SIFT, CS and BOC.From Table 7, we can observe that our CTS descriptor outperforms all the other features.With the same training samples, the CTS descriptor obtains an improved performance compared to SSEP of approximately 10.85%.In addition, to exclude the effect of different classifiers, we utilize a logic regression (LR) classifier based on the CTS descriptor, which is used in OF, EP and SSEP.The classification results based on logic regression are poorer than the results based on HIK SVM but are also substantially improved compared to the results obtained using SSEP.Figure 7 shows the classification results of each feature.Overall, the CTS descriptor provides the best visual effects.Road and farmland are almost all classified correctly because of their shape features and uniform color.Nevertheless, some misclassified patches remain.This can be explained as follows: First, the terrain is complex, and it is not possible to discern variation with absolute precision.Next, the 100 ˆ100 patch cannot realistically contain only one category.When labeling a patch, we mark it as the class with the largest weight.This is why there are some misclassifications at the boundary of two classes.To further analyze the classification results, we use the confusion matrix illustrated in Figure 8.Based on the visual effect, water, road and farmland achieve the best results; city green land is mixed with farmland; and factories are mixed with high buildings.features and uniform color.Nevertheless, some misclassified patches remain.This can be explained as follows: First, the terrain is complex, and it is not possible to discern variation with absolute precision.Next, the 100 × 100 patch cannot realistically contain only one category.When labeling a patch, we mark it as the class with the largest weight.This is why there are some misclassifications at the boundary of two classes.To further analyze the classification results, we use the confusion matrix illustrated in Figure 8.Based on the visual effect, water, road and farmland achieve the best results; city green land is mixed with farmland; and factories are mixed with high buildings.

Experiments on Scene-TA
The purpose of the experiment on Scene-TA is to further verify the generalizability of our CTS descriptor to HRS images.Scene-TA was acquired by GeoEye-1 in 2010, near an airport in Tucson, Arizona, USA.The original image and geographic location are shown in Figure 9. Scene-TA is 4500 ˆ4500 pixels and contains 7 main semantic regions: water, buildings 1, buildings 2, buildings 3, dense grassy land, bare land, and sparse grassy land.Figure 10a shows an example of each class and Figure 10b shows the hand-labeled ground reference data.

Experiments on Scene-TA
The purpose of the experiment on Scene-TA is to further verify the generalizability of our CTS descriptor to HRS images.Scene-TA was acquired by GeoEye-1 in 2010, near an airport in Tucson, Arizona, USA.The original image and geographic location are shown in Figure 9. Scene-TA is 4500 × 4500 pixels and contains 7 main semantic regions: water, buildings 1, buildings 2, buildings 3, dense grassy land, bare land, and sparse grassy land.Figure 10a shows an example of each class and Figure 10b shows the hand-labeled ground reference data.In accordance with the previous experimental settings, the primitive patch contains 100 × 100 pixels.The entire image consists of 2025 patches.In addition, we manually select 10 samples per class as the training set, and the remaining samples are the testing set.To ensure a fair comparison, we use the same training samples in other methods.
Figure 10 shows the classification results on Scene-TA.Based on the ground reference data, we observe clear boundaries and obvious differences in color and texture.Thus, the CTS descriptor achieves a good classification result, and the classification accuracy reaches 78.62%.Table 8 shows a comparison with other methods.The direct concatenation of SIFT, BOC and CS (OF) is less discriminative compared with the features learned by ensemble projection and semi-supervised ensemble projection, while semi-supervised ensemble projection achieves the best result among the local features of SIFT, CS and BOC.When using the local features combination based on CBPT, CTS descriptor achieves a better result with LR classifier.The visual classification results are shown in Figure 10.Due to the full utilization of the spatial multi-scale characteristics and the topological

Discussion
HRS image classification plays an important role in understanding remotely sensed imagery.In this paper, we build a multi-scale spatial representation and analyze the color, texture and structure information of an HRS image.Our objective is to design a discriminative color-texture-structure (CTS) descriptor for high-resolution image classification.The experimental results on the UCM-21 dataset and two large satellite images indicate that the proposed CTS descriptor outperforms state-of-the-art methods.
The construction of the CBPT plays a vital role in our algorithm.The region model of the CBPT affects the robustness and discrimination of our final descriptor.Moreover, the computational complexity of the CBPT affects the efficiency of our method.As described in Section 2, our CBPT merges the original pixels, and the small regions are not sufficiently semantic.Furthermore, assuming Figure 10 shows the classification results on Scene-TA.Based on the ground reference data, we observe clear boundaries and obvious differences in color and texture.Thus, the CTS descriptor achieves a good classification result, and the classification accuracy reaches 78.62%.Table 8 shows a comparison with other methods.The direct concatenation of SIFT, BOC and CS (OF) is less discriminative compared with the features learned by ensemble projection and semi-supervised ensemble projection, while semi-supervised ensemble projection achieves the best result among the local features of SIFT, CS and BOC.When using the local features combination based on CBPT, CTS descriptor achieves a better result with LR classifier.The visual classification results are shown in Figure 10.Due to the full utilization of the spatial multi-scale characteristics and the topological relationships of objects, the CTS suffers from fewer misidentifications.Compared to the ground reference data, several sparse grassy land patches are misclassified as bare land because the bare land contains a few scattered areas of grass, which can also be observed in the confusion matrix shown in Figure 11.Ideally, the patches labeled with water can all be correctly classified because of the unique dark color and smooth texture.

Discussion
HRS image classification plays an important role in understanding remotely sensed imagery.In this paper, we build a multi-scale spatial representation and analyze the color, texture and structure information of an HRS image.Our objective is to design a discriminative color-texture-structure (CTS) descriptor for high-resolution image classification.The experimental results on the UCM-21 dataset and two large satellite images indicate that the proposed CTS descriptor outperforms state-of-the-art methods.
The construction of the CBPT plays a vital role in our algorithm.The region model of the CBPT affects the robustness and discrimination of our final descriptor.Moreover, the computational complexity of the CBPT affects the efficiency of our method.As described in Section 2, our CBPT merges the original pixels, and the small regions are not sufficiently semantic.Furthermore, assuming

Discussion
HRS image classification plays an important role in understanding remotely sensed imagery.In this paper, we build a multi-scale spatial representation and analyze the color, texture and structure information of an HRS image.Our objective is to design a discriminative color-texture-structure (CTS) descriptor for high-resolution image classification.The experimental results on the UCM-21 dataset and two large satellite images indicate that the proposed CTS descriptor outperforms state-of-the-art methods.
The construction of the CBPT plays a vital role in our algorithm.The region model of the CBPT affects the robustness and discrimination of our final descriptor.Moreover, the computational complexity of the CBPT affects the efficiency of our method.As described in Section 2, our CBPT merges the original pixels, and the small regions are not sufficiently semantic.Furthermore, assuming that there are N nodes in level n, the total number of nodes of all upper levels will be less than N. Therefore, smaller regions result in fewer levels, and the number of regions will increase exponentially.Thus, we can choose more semantic regions, such as superpixels generated by over-segmentation methods, as leaf nodes [68].
The region-based feature extraction method relies on the setting of a proper threshold for the region size.The size of a semantic region varies with the resolution.Therefore, choosing an appropriate threshold is a considerable task.Furthermore, the parameters of co-occurrence patterns also affect the results, and the distance between the region and its ancestor influence their similarity.Short distances result in redundancy, whereas large distances result in low discrimination.
Color, texture and structure features characterize HRS images from three different aspects.Because they are complementary in terms of image description, descriptors based on an efficient combination of the three cues should be more discriminative.As for certain categories, each feature channel is discriminative, e.g., the beach in the UC Merced dataset, which results minimal confusion with other categories.However, the city greenland and farmland in Scene-TZ exhibit homogeneity in terms of color and texture but heterogeneity in terms of structure, which is emphasized by the compact rectangle shape.The proposed CTS descriptor achieves good classification results on several HRS image datasets.As an object-based image analysis method, the CBPT representation fully considers the multi-scale property and topological relationships of objects in HRS images.Furthermore, we present an efficient combination of early and late fusion of color and texture based on CBPT.There are many feature fusion methods in the literature [1][2][3][4][5], most of which being characterized by late fusion of color and texture; i.e., the multiple cues are combined in the classification process.Particularly, CTS implements an efficient combination of color, texture and structure based on CBPT, and achieves the early fusion of local regions and late fusion in the classification process.However, this descriptor suffers from certain limitations.As previously mentioned, the region model is a very important concept in BPT construction.In this work, the merge criterion of the CBPT is based on the Euclidean distance of the HSV color space, and we choose the average of three channels as the region model.However, when the region size is large, this choice is not optimal because substantial amounts of information are lost by the averaging procedure.Our future work will concentrate on a more semantic region model and similarity criterion; i.e., the building process of the CBPT representation usually involves calculating three types of dissimilarity distances: pixel to pixel, pixel to region and region to region.A unified and robust dissimilarity distance for these three cases is desired.

Conclusions
In this paper, a region-based color-texture-structure descriptor, i.e., the CTS descriptor, has been proposed to classify HRS images via a hierarchical color binary partition tree structure.The main contribution of the CTS descriptor is the use of CBPT to analyze color and texture information, which specifically combines the early and late fusion methods of cues and analyzes the co-occurrence patterns of several objects.The efficiency of the proposed method is substantiated by classification experiments on the 21-class UC Merced dataset and on two large satellite images.Both qualitative and quantitative analyses confirmed the improved performance of the proposed CTS descriptor compared with several other approaches.By defining the initial partition of the merging process on an over-segmentation result, i.e., a super-pixel partition, the computational and memory costs of BPT generation can be drastically reduced.Thus, the proposed CTS descriptor can be easily extended to process and analyze very large images.In the future, we intend to explore more semantically meaningful region models using super-pixel partition based initialization and more discriminative visual patterns in BPT representation.

Figure 1 .
Figure 1.Flowchart of high-resolution satellite (HRS) image classification based on the CTS descriptor.

Figure 1 .
Figure 1.Flowchart of high-resolution satellite (HRS) image classification based on the CTS descriptor.

Figure 2 .
Figure 2. Schematic map of the Binary Partition Tree (BPT) construction.(a) Original image with 4 regions; (b) The construction of the BPT.

Figure 2 .
Figure 2. Schematic map of the Binary Partition Tree (BPT) construction.(a) Original image with 4 regions; (b) The construction of the BPT.

Figure 3 .
Figure 3.The segmentation of the HRS image via Color Binary Partition Tree (CBPT) (a) An HRS airport image; (b) Segmentation results at multiple scales.

Figure 3 .
Figure 3.The segmentation of the HRS image via Color Binary Partition Tree (CBPT) (a) An HRS airport image; (b) Segmentation results at multiple scales.

p 1 , f p 2 , 1 , c p 2 ,
¨¨¨, f p N s is a set of pattern descriptors extracted from the CBPT of an image, B is the dictionary, C " rc p ¨¨¨, c p N s is the set of coefficients for fitting F, d denotes elementwise multiplication, and d i is the locality adaptor, with d i " expp distpf p i , B p q σ q.

p 1 , f p 2 ,
¨¨¨, f p N s, fitting F p with a probabilistic model ppF, Θq and representing data with the derivative of the data's log-likelihood.

Figure 4 .
Figure 4. Two typical examples of each class of the 21-class UC Merced dataset.To conduct the classification experiments, the number of randomly selected training samples per class is set to 80 images, and the remaining 20 samples are retained for testing.Once the popular descriptors' experimental settings on the UC Merced dataset are applied (BOVW, SPM, Bag of colors etc.), there are 420 images that remain for testing.To ensure a fair comparison, we use the training set

Figure 4 .
Figure 4. Two typical examples of each class of the 21-class UC Merced dataset.

Figure 5 .
Figure 5. Confusion matrix for the descriptors based on the CTS descriptor on the UC Merced dataset.

Figure 5 .
Figure 5. Confusion matrix for the descriptors based on the CTS descriptor on the UC Merced dataset.

Figure 6 .
Figure 6.(a) The original image of Scene-TZ; (b) The geographic location of Scene-TZ.

Figure 6 .
Figure 6.(a) The original image of Scene-TZ; (b) The geographic location of Scene-TZ.

Figure 7 .
Figure 7. Classification result on the Scene-TZ Dataset.(a) A typical sample of each class in the 8-class Scene-TZ (b) Ground reference data (c) OF result (d) EP result (e) SSEP result (f) CTS result.

Figure 7 .
Figure 7. Classification result on the Scene-TZ Dataset.(a) A typical sample of each class in the 8-class Scene-TZ (b) Ground reference data (c) OF result (d) EP result (e) SSEP result (f) CTS result.

Figure 7 .
Figure 7. Classification result on the Scene-TZ Dataset.(a) A typical sample of each class in the 8-class Scene-TZ (b) Ground reference data (c) OF result (d) EP result (e) SSEP result (f) CTS result.

Figure 8 .
Figure 8. Confusion matrix of the classification result based on the CTS on Scene-TZ.

Figure 8 .
Figure 8. Confusion matrix of the classification result based on the CTS on Scene-TZ.

Figure 9 .
Figure 9. (a) The original image of Scene-TA; (b) The geographic location of Scene-TA.

Figure 9 .
Figure 9. (a) The original image of Scene-TA; (b) The geographic location of Scene-TA.

Figure 10 .
Figure 10.Classification result on the Scene-TA Dataset (a) The original image (b) Ground reference data (c) OF result (d) EP result (e) SSEP result (f) CTS result.

Figure 11 .
Figure 11.Confusion matrix for the classification results based on the CTS descriptor on Scene-TA.

Figure 10 .
Figure 10.Classification result on the Scene-TA Dataset (a) The original image (b) Ground reference data (c) OF result (d) EP result (e) SSEP result (f) CTS result.

Figure 10 .
Figure 10.Classification result on the Scene-TA Dataset (a) The original image (b) Ground reference data (c) OF result (d) EP result (e) SSEP result (f) CTS result.

Figure 11 .
Figure 11.Confusion matrix for the classification results based on the CTS descriptor on Scene-TA.

Figure 11 .
Figure 11.Confusion matrix for the classification results based on the CTS descriptor on Scene-TA.

Table 2 .
The results obtained using three different coding methods on local descriptors.

Table 3 .
Comparison of different color features on BPT.

Table 4 .
Classification results under different minimum region sizes on the UCM dataset.

Table 5 .
Classification results under different dictionary sizes on the UCM dataset.

Table 6 .
Classification result comparison on UC Merced dataset.

Table 6 .
Classification result comparison on UC Merced dataset.

Table 7 .
Classification accuracy comparison on Scene-TZ with ten training samples per class.

Table 7 .
Classification accuracy comparison on Scene-TZ with ten training samples per class.

Table 7 .
Classification accuracy comparison on Scene-TZ with ten training samples per class.

Table 8 .
Classification accuracy comparison on Scene-TA with ten training samples per class.