The rapid progress in remote sensing imaging techniques over the past decade has produced an explosive amount of remote sensing (RS) satellite images with different spatial resolutions and spectral coverage. This allows us to potentially study the ground surface of the Earth in greater detail. However, it remains extremely challenging to extract useful information from the large number of diverse and unstructured raw satellite images for specific purposes, such as land resource management and urban planning [
1,
2,
3,
4]. Understanding the land on Earth using satellite images generally requires the extraction of a small sub-region of RS images for analysis and for exploring the semantic category. The fundamental procedure of classifying satellite images into semantic categories firstly involves extracting the effective feature for image representation and then constructing a classification model by using manually-annotated labels and the corresponding satellite images. The success of the bag-of-visual-words (BOW) model [
5,
6,
7] and its extensions for general object and natural scene classification has resulted in the widespread application of these models for solving the semantic category classification problem in the remote sensing community. The BOW model was originally developed for text analysis and was then adapted to represent images by the frequency of ”visual words” that are generally learned from the pre-extracted local features from images by a clustering method (K-means) [
5]. In order to reduce the reconstruction error led by approximating a local feature with only one ”visual word” in K-means, several variant coding methods such as sparse coding (Sc), linear locality-constrained coordinate (LLC) [
8,
9,
10,
11] and the Gaussian mixture model (GMM) [
12,
13,
14,
15] have been explored in the BOW model for improving the reconstruction accuracy of local features, and some researchers further endeavored to integrate the spatial relationships of the local features. On the other hand, local features such as SIFT [
16], which is handcrafted and designed as a gradient-weighted orientation histogram, are generally utilized and remain untouched in terms of their strong effect on the performance of these BOW-based methods [
17,
18,
19]. Therefore, some researchers investigated the local feature learning procedure automatically from a large number of unlabeled RS images via unsupervised learning techniques instead of using the handcrafted local feature extraction [
20,
21,
22], thereby improving the classification performance to some extent. Recently, deep learning frameworks have witnessed significant success in general object and natural scene understanding [
23,
24,
25] and have also been applied to remote sensing image classification [
26,
27,
28,
29,
30]. These framework perform impressively compared with the traditional BOW model. All of the above-mentioned algorithms firstly explore the spatial structure for providing the local features, which is important for local structure analysis in high-definition general images, such as those in which a single pixel covers several centimeters or millimeters. However, the available satellite images are usually acquired at a ground sampling distance of several meters, e.g., 30 m for Landsat 8 and 1 m even for high-definition satellite images from the National Agriculture Imagery Program (NAIP) dataset [
31]. Thus, the spatial resolution of a satellite image is much lower than that of a general image, and the spatial analysis of nearby pixels, which often belong to different categories in a satellite image, may not be suitable. Recently, Zhong et al. [
32] proposed an agile convolution neural network (CNN) architecture, named SatCNN, for high-spatial resolution RS image scene classification, which used smaller kernel sizes for building the effective CNN architecture and validated promising performance.
On the other hand, despite its low spatial resolution, a satellite image is usually acquired in multiple spectral bands (also known as hyper-spectral data), which is expected for pixel-wise land cover investigation even with mixing pixels. It is labor intensive to concentrate on the traditional mixing pixel recovery problem (known as the unmixing model) [
33,
34,
35]. This model can obtain material composition fraction maps and a set of spectra of pure materials (also known as endmembers) and has achieved acceptable pure pixel recovery results. These pixel-wise methods assume that the input images contain pure endmembers and that they can process the image with mixed pixels of several or dozens of endmembers. This study aims to classify a small sub-region of the satellite image into a semantic category by considering that a pixel spectrum in an explored sub-region is a supposition of several spectral prototypes (possible endmembers). At the same time, because of the large variety of multi-spectral pixels even for the same material due to environmental changes, we generate an over-complete spectral prototype set (dictionary or codebook), which means that the number of prototypes is larger than the number of spectral bands. It also takes into consideration the variety of multi-spectral pixels for the same material, whereas most optimization methods for simplex (endmember) identification [
36,
37,
38,
39] in an unmixing model generally only obtain a sub-complete prototype set, thereby possibly ignoring some detailed spectral structures for representation. Therefore, based on the learned over-complete spectral codebook, any pixel spectrum can be well reconstructed by a linear combination of only several spectral prototypes to produce a sparse coded vector. Furthermore, deciding how to aggregate the sparse coded spectral vector for the sub-region representation is a critical step for affecting the final recognition performance. In the conventional BOW model and its extensions with the spatially-analyzed local features, the coded vectors in an image are generally aggregated with an average or max pooling strategy. The average pooling simply takes the mean value of the coded coefficients corresponding to a learned visual word, which is specially utilized accompanied with hard assignment (i.e., representing any local feature using only one visual word), whereas max pooling takes the maximum value of all coded coefficients in an image or region corresponding to a learned visual word (atom), which is applied accompanied with soft-assignment or sparse coding approaches. The max pooling strategy accompanied with sparse coding approaches achieved promising performance in the classification and detection of different objects, which means that only exploiting the highest activation status of the local description prototype (possibly a distinct local structure in an object with spatial analysis) is effective. However, the max pooling strategy only retains the strongest activated pattern and would completely ignore the frequency: an important signature for identifying different types of images of the activated patterns. In addition, because of the low spatial resolution of satellite images, the exploration of spatial analysis and pixel-wise spectral analysis to provide the composition fraction of any spectral prototype would be unsuitable. We aim to obtain the statistical fractions of each spectral prototype to represent the explored sub-region, whereas max-pooling unavoidably ignores almost all of the coded spectral coefficients, while average pooling would take the coded spectral coefficients of some outliers to form the final representation. Therefore, this study proposes a hybrid aggregation (pooling) strategy of the sparse coded spectral vectors by integrating not only the maximum magnitude, but also the response magnitude of the relatively large coded coefficients of a specific spectral prototype, a process named K-support pooling. This proposed hybrid pooling strategy combines the popularly-applied average and max pooling methods and, rather than awfully emphasizing the maximum activation, preferring a group of activations in the explored region instead. The proposed satellite image representation framework is shown in
Figure 1, where the top row is for over-complete spectral prototype set learning, and the bottom row manifests the sparse coding of any pixel spectral and the hybrid pooling strategy of all coded spectral vectors in a sub-region to form the discriminated feature.
Because of the low spatial resolution of satellite images, this study explores the spectral analysis method instead of spatial analysis, which is widely used in general object and natural scene recognition. The main contributions of our work are two-fold: (1) unlike the spectral analysis in the unmixing model, which usually only obtains the sub-complete basis (the number of the bases is fewer than the number of spectral bands) via simplex identification approaches, we investigate the over-complete dictionary for more accurate reconstruction of any pixel spectrum and obtain the reconstruction coefficients by using a sparse coding technique; (2) we generate the final representation of a satellite image from all coded sparse spectral vectors, for which we propose a generalized aggregation strategy. This strategy not only integrates the maximum magnitude, but also the response magnitude of the relatively large coded coefficients of a specific spectral prototype instead of employing the conventional max and average pooling approaches.
This paper is organized as follows.
Section 2 describes related work including the BOW model based on spatial analysis and the multi-spectral unmixing problem by assuming a limited number of bases (endmembers) and the corresponding abundance for each spectral pixel. The proposed strategy, which entails sparse coding for multi-spectral representation of pixels, is introduced in
Section 3 together with a generalized aggregation approach for coded spectral vectors. The experimental results and discussions are provided in
Section 4. Finally, the concluding remarks are presented in
Section 5.