1. Introduction
Remote sensing has received unprecedented attention due to its role in mapping land cover [
1], geographic image retrieval [
2], natural hazards’ detection [
3], and monitoring changes in land cover [
4]. The currently available remote sensing satellites and instruments (e.g., IKONOS, unmanned aerial vehicles (UAVs), synthetic aperture radar) for observing the Earth not only provide high-resolution scene images but also give us an opportunity to study the spatial information with a fine-grained detail [
5].
However, within-class diversity and between-class similarity among scene categories are the main challenges that make it extremely difficult to distinguish the scene classes. For instance, as shown in
Figure 1a, a large intra-class or within-class diversity can be observed such as the resort scenes appearing in different building styles but all of them belong to the same class. Similarly, the park scenes show large differences within the same semantic class. In addition, satellite imagery data can be influenced by differences in color or radiation intensity due to factors such as weather, cloud coverage and mist, which, in turn, may cause within-class diversity [
6,
7]. In terms of inter-class or between-class similarity, the challenge is caused by the appearance of the same ground objects within the different scene classes as illustrated in
Figure 1b. For instance, we can see that stadium and playground are different classes but represent a highly semantic overlapping between scene categories. This motivates us to focus on multi-level spatial features with small within-class scatter but large between-class separation. Here, the “scenes” belong to a different type of subareas extracted from large satellite images. These subareas could be different types of land covers or objects and possess a specific semantic meaning, such as commercial area, dense residential, sparse residential, and parking lot in a typical urban area of satellite image [
6]. With the development of modern technologies, scene classification has been an active research field, and correctly labeling it to a predefined class is still a challenging task.
In the early days, most of the approaches focused on hand-crafted features, which can be computed based on shape, color, or textual characteristics where commonly used descriptors are local binary patterns (LBPs) [
8], scale invariant feature transform [
9], color histogram [
10], or histogram oriented gradients (HOG) [
11]. A major shortcoming of these low-level descriptors is their inability to fulfill scene understanding due to the high diversity and non-homogeneous spatial distributions of the scene classes. In comparison to handcrafted features, the bag-of-words (BoW) model is one of the famous mid-level (global) representations, which became extremely popular in image analysis and classification [
12], while providing an efficient solution for aerial or satellite image scene classification [
13]. It was first proposed for text analysis and then extended to images by a spatial pyramid method (SPM) because the vanilla BoW model does not consider spatial and structural information. Specifically, the SPM method divides the images into several parts and computes BoW histograms from each part based on the structure of local features. The histograms are then concatenated from all image parts to make up the final representation [
14]. Although these mid-level features are highly efficient, they may not be able to characterize detailed structures and distinct patterns. For instance, some scene classes are represented mainly by individual objects, e.g., runway and airport in remote sensing datasets. As a result, the performance of BoW model remains limited when dealing with complex and challenging scene images.
Recently, deep-learning-based methods have been successfully utilized in scene classification and proven to be promising in extracting high-level features. For instance, Shi et al. [
15] proposed a multi-level feature fusion method based on a lightweight convolution neural network to improve the classification performance of scene images. Yuan et al. [
16] proposed a multi-subset feature fusion method to integrate the global and local information of the deep features. A dual-channel spectral feature extraction network is introduced in [
17]. Their model employs a 3D convolution kernel to directly extract multi-scale spatial features. Then an adaptive fusion of spectral and spatial features is performed to improve the performance. These methods testify of the importance of deep-learning-based feature fusion. However, patch-based global feature learning has been never deeply investigated in the BoW framework. Moreover, the authors in [
18] argued that convolutional layers record a precise position of features in the input and generate a fixed-dimensional representation in the CNN framework. Since the pooling process decreases the size of the feature matrix after the convolution layer, the performance of CNN remains limited in the case where key features are minute and irregular [
19]. One of the reasons is that natural images can be mainly captured by cameras with manual or auto-focus options, which make them center-biased [
20]. However, in the case of remote sensing scene classification, images are usually captured overhead. Therefore, using a CNN as a “black box” to classify remote sensing images may be not good enough for complex scenes. Even though several works [
21,
22] attempted to focus on critical local image patches, the role of the spatial dependency among objects in remote sensing scene classification task remains an unsolved problem [
23].
In general, patch sampling or feature learning is a critical component for building up an intelligent system either for the CNN model or BoW-based approaches. Ideally, special attention should be paid to image patches that are the most informative for the classification task. This is due to the fact that objects can appear at any location in the image [
24]. Recent studies address this issue by sampling feature points based on a regular dense grid [
25] or a random strategy [
26] because there is no clear consensus about which sampling strategy is most suitable for natural scene images. Although multiscale keypoint detectors (e.g., Harris-affine, Laplacian of Gaussian) as samplers [
27] are well studied in the computer vision community, they were not designed to find the most informative patches for scene image classification [
26]. In this paper, instead of working towards a new CNN model or a local descriptor, we introduce a patch-based discriminative learning (PBDL) to extract image features region by region based on small, medium, and large neighborhood patches to fully exploit the spatial structure information in the BoW model. Here, the definition of different neighborhood sizes is considered small, medium, or large regions depending on the patch sizes. For instance, the patch size
represents the small region, the patch size
represents the medium region, and large sizes are represented by
,
. This is motivated by the fact that different patch sizes still exhibit good learning ability of spatial dependencies between image region features that may help to interpret the scene [
28,
29].
Figure 2 illustrates the extracted regions used in our work. In
Figure 2 from right to left: the first column images represent the dark green color due to the patch size of
. The images in the second column show the SURF features with the size of
. Likewise,
, and
patch sizes are used, and their features are displayed in the third and fourth columns, respectively. Moreover, the proposed method also magnifies the visual information by utilizing Gaussian pyramids in a scale-space setting to improve the classification performance. In particular, the idea of magnifying the visual information in our work is based on generating a multi-scale representation of an image by creating a one-parameter family of derived signals [
30]. Since the proposed multi-level learning is based on different image patch sizes, spatial receptive fields may overlap due to unique nature of remote sensing scene images (e.g., buildings, fields, etc.). Thus, we also consider the
sampling redundancy problem to minimize the presence of nearby or neighboring pixels. We show that overlapping pixels can be minimized by setting pixel stride equal to the pixel width of the feature window.
Next, we balance the contribution of individual patch features by proposing a simple fusion strategy based on two motivations. Firstly, the proposed method introduces a simple fusion strategy that can surpass the previous performance without utilizing state-of-the-art fusion methods such as DCA [
32], PCA [
33], CCA [
34], as previously utilized in remote sensing domain (we further discuss this aspect in
Section 4.3). The second motivation is to evade the disadvantages of traditional dimensionality reduction techniques such as principle component analysis (PCA) because of their data-dependent characteristic, the computational burden of diagonalizing the covariance matrix, and the lack of guarantee that distances in the original and projected spaces are well retained. Finally, the BiLSTM [
35] network is adopted after combining small, medium, and large scale spatial and visual histograms to classify scene images. We demonstrate that the collaborative fusion of the different regions (patch sizes) addresses the problem of
intra-class difference, and the aggregated multi-scale features in scale-space pyramids can be used to solve the problem of
inter-class similarity. To this end, our main contributions in this paper are summarized as follows:
- 1.
We present a patch-based discriminative learning to combine all the surrounding features into a new single vector and address the problem of intra-class diversity and inter-class similarity.
- 2.
We demonstrate the effectiveness of patch-based learning in the BoW model for the first time. Our method suggests that exploring visual descriptor on image regions independently can be more effective than random sampling for the remote sensing scene classification.
- 3.
To enlarge the visual information, smoothing and stacking is performed by convolving the image with Gaussian second derivatives. In this way, we integrate the fixed regions (patches) into multiple downscaled versions of the input image in a scale-space pyramid. By doing so, we explore more content and important information.
- 4.
The proposed method not only surpasses the previous BoW methods but also several state-of-the-art deep-learning-based methods on four publicly available datasets and achieves state-of-the-art results.
The rest of this work is organized as follows.
Section 2 discusses the related literature work of this study.
Section 3 introduces the proposed PBDL for remote sensing scene classification.
Section 4 shows the experimental results of the proposed PBDL on several public benchmark datasets.
Section 5 summarizes the entire work and gives suggestions for future research.
2. Literature Review
In the early 1970s, most of the early methods in remote sensing image analysis focused on per-pixel analysis, through labeling each pixel in the satellite images (such as the Landsat series) with a semantic class, because the spatial resolution of Landsat images acquired by satellite sensor is very low where the size of a pixel is close to the sizes of the objects of interest [
7]. With the advances in remote sensing technology, the spatial resolution of remote sensing images is increasingly finer than the typical object of interest, and the objects are usually composed of many pixels, such that single pixels lost their semantic meanings. In such cases, it is difficult or sometimes impoverished to recognize scene images at the pixel level solely. In 2001, Blaschke and Strobl [
36] raised the critical question “What’s wrong with pixels?” to conclude that analyzing remote sensing images at the object level is more efficient than the statistical analysis of single pixels. Afterward, a new paradigm of approaches to analyze remote sensing images at the object level has dominated for the last two decades [
7].
However, pixel and object-level classification methods may not be sufficient to always classify them correctly because pixel-based identification tasks carry little semantic meanings. Under these circumstances, semantic-level remote sensing image scene classification seeks to classify each given remote sensing image patch into a semantic class that contains explicit semantic classes (e.g., commercial area, industrial area, and residential area). This led to categorization of remote sensing image scene classification into three main classes according to the employed features: human engineering-based methods, unsupervised feature learning (or global-based methods), and deep feature learning-based methods. Early works in scene classification required a considerable amount of engineering skills and are mainly based on handcrafted descriptors [
8,
10,
37,
38]. These methods mainly focused on texture, color histograms, shape, spatial and spectral information, and were invariant to translation and rotation.
In brief, handcrafted features have their own benefits and disadvantages as well. For instance, color features are more convenient to extract in comparison with texture and shape features [
38]. Indeed, color histograms and color moments provide discriminative features and can be computed based on local descriptors such as local binary patterns (LBPs) [
8], scale invariant feature transform (SIFT) [
9], color histogram [
10], and histogram oriented gradients (HOG) [
11]. Although color-based histograms are easy to compute, they do not convey spatial information and the high resolution of scene images makes it very difficult to distinguish the images with the same colors. Yu et al. [
39] proposed a new descriptor called color-texture-structure (CTS) to encode color, texture, and structure features. In their work, a dense approach was used to build the hierarchical representation of the images. Next, the co-occurrence patterns of regions were extracted and the local descriptors were encoded to test the discriminative capability. Tokarczyk et al. [
38] proposed to use the integral images and extract discriminative textures at different scale levels of scene images. The features were named Randomized Quasi-Exhaustive (RQE) which are capable of covering a large range of texture frequencies. The main advantage of extracting these spatial cues such as color, texture, or spatial information is that they can be directly utilized by classifiers for scene classification. On the other hand, every individual cue focused only on one single type of feature, so it remains challenging or inadequate to illustrate the content of the entire scene image. To overcome this limitation, Chen et al. [
37] proposed a combination of different features such as color, structure, and texture features. To perform the classification task, the k-nearest-neighbor (KNN) classifier and the support vector machine classifiers (SVM) were employed and the decision level fusion was performed to improve the performance of scene images. Zhang et al. [
40] focused on the variable selection process based on random forests to improve land cover classification.
To further improve the robustness of handcrafted descriptors, the bag-of-words (BoW) framework has made significant progress for remote sensing image scene classification [
41]. By learning global features, Khan et al. [
42] investigated multiple hand-crafted color features in the bag-of-word model. In their work, color and shape cues were used to enhance the performance of the model. Yang et al. [
43] utilized the BoW model using the spatial co-occurrence kernel, where two spatial extensions were proposed to emphasize the importance of spatial structure in geographic data. Vigo et al. [
44] proved that incorporating color and shape in both feature detection and extraction significantly improves the bag-of-words based image representation. Sande et al. [
45] proposed a detailed study about the invariance properties of color descriptors. They concluded that the addition of color descriptors over SIFT increases the classification accuracy by 8%. Lazebnik et al. [
14] proposed a spatially hierarchical pooling stage to form the spatial pyramid method (SPM). To improve the SPM pooling stage, sparse codes (SC) of SIFT features were merged into the traditional SPM [
46]. Although, researchers have proposed several methods to achieve good performance for land use classification, especially compared to handcrafted feature-based methods, one of the major disadvantages of BoW is that it neglects the spatial relationships among the patches, and the performance remains unclear, especially the localization issue is not well understood.
Recently, most of the current state-of-the-art approaches generally rely on end-to-end learning to obtain good feature representations. Specifically, the use of convolutional neural networks (CNN) is the state-of-the-art framework in scene image classification. In this case, convolutional layers convolve the local image regions independently, and pass their results to the next layer, whereas pooling layers summarize the dimensions of data. Due to the wide range of image resolution and the various scales of detail textures, fixed-sized kernels are inadequate to extract scene features of different scales. Therefore, the focus of current literature has been shifted to multi-scale and fusion methods in the scene image classification domain, and existing deep learning methods are making full use of multi-scale information and fusion for a better representation. For instance, Ghanbari et al. [
47] proposed a multi-scale method called dense-global-residual network to reduce the loss of spatial information and enhance the context information. The authors used a residual network to extract the features and a global spatial pyramid pooling module to obtain dense multi-scale features at different levels. Zuo et al. [
48] proposed a convolutional recurrent neural network to learn the spatial dependencies between image regions and enhance the discriminative power of image representation. The authors trained their model in an end-to-end manner where CNN layers were used to generate mid-level features and RNN was used for learning contextual dependencies. Huang et al. [
49] proposed an end-to-end deep learning model with multi-scale feature fusion, channel-spatial attention, and a label correlation extraction module. Specifically, a channel-spatial attention mechanism was used to fuse and refine multi-scale features from different layers of the CNN model.
Li et al. [
50] proposed an adaptive multi-layer feature fusion model to fuse different convolutional features with feature selection operation, rather than simple concatenation. The authors claimed that their proposed method is flexible and can be embedded into other neural architectures. Few-shot scene classification was introduced by proposing an end-to-end network, called discriminative learning of adaptive match network (DLA-MatchNet) in [
51]. The authors addressed the issues of the large intraclass variances and interclass similarity by introducing the attention mechanism into the feature learning process. In this way, discriminative regions were extracted, which helps the classification model to emphasize valuable feature information. Xiwen et al. [
52] proposed a unified annotation framework based on a stacked discriminative sparse autoencoder (SDSAE) and weakly supervised feature transferring. The results demonstrated the effectiveness of weakly supervised semantic annotation in remote sensing scene classification. Rosier et al. [
53] found that fusing Earth observation and socioeconomic data lead to increases in the accuracy of urban land use classification.
Due to the wide range of image resolution and various scales of detail textures, fixed-sized CNN kernels are inadequate to extract scene features of different scales. Therefore, the focus has been shifted to multi-scale, attention mechanism, and fusion methods in the scene image classification domain, and existing deep learning methods are making full use of multi-scale information and fusion for a better representation. The main idea behind the attention mechanism was initially developed in 2014 for natural language processing applications [
54] based on the assumption that different weights assigned to different pieces of information can be provided to attract the attention of the model [
24].
In our work, we pay particular attention to the previous work [
33] where the authors claimed that a simple combination strategy achieves less than
accuracy when the fusion of deep features (AlexNet, VGG-M, VGG-S and CaffeNet) was applied. Thus, a natural question arises: can we combine different region features effectively and efficiently to address scene image classification? With the exception [
32], to our knowledge, this question still remains mostly unanswered. In particular, the deep features from fully connected layers with the DCA method were fused to improve the scene image classification in [
32]. We show that raw SURF features produce good informative features to describe the images scene with a simple concatenation of different patch-size features. Experimental results on four public remote sensing image datasets demonstrate that combining the proposed discriminative regions can improve performance up to
,
,
and
for NWPU, AID, WHU-RS and UC Merced datasets, respectively.