With the development of Earth observation technology, many different types (e.g., multi/hyperspectral [1
] and synthetic aperture radar [2
]) of high-resolution images of the Earth’s surface are readily available. Therefore, it is particularly important to effectively understand their semantic content, and more intelligent identification and classification methods of land use and land cover (LULC) are definitely demanded. Remote sensing image scene classification, which aims to automatically assign a specific semantic label to each remote sensing image scene patch according to its contents, has become an active research topic in the field of remote sensing image interpretation because of its vital applications in LULC, urban planning, land resource management, disaster monitoring, and traffic control [3
During the last decades, several methods have been developed for remote sensing image scene classification. The early methods for scene classification were mainly based on low-level features or hand-crafted features, which focus on designing various human-engineering features locally or globally, such as color, texture, shape, and spatial information. Representative features, including the scale invariant feature transform (SIFT), color histogram (CH), local binary pattern (LBP), Gabor filters, grey level cooccurrence matrix (GLCM), and the histogram of oriented gradients (HOG) or their combinations, are usually used for scene classification [7
]. It is worth noting that methods relying on these low-level features perform well on some images with uniform texture or spatial arrangements, but they are still limited for distinguishing images with more challenging and complex scenes, which is because the involvement of humans in feature design significantly influences the effectiveness of the representation capacity of scene images. In contrast to low-level feature-based methods, the mid-level feature approaches attempt to compute a holistic image representation formed by local visual features such as SIFT, color histogram, or LBP of local image patches. The general pipeline of building mid-level features is to extract local attributes of image patches first and then to encode them to obtain the mid-level representation of remote sensing images. The well-known bag-of-visual-words (BoVW) model is the most popular mid-level approach and has been widely adopted for remote sensing image scene classification because of its simplicity and effectiveness [13
]. The methods based on the BoVW have improved the classification performance, but due to the limitation of representation capability of the BOVW model, no further breakthroughs have been achieved for remote sensing image scene classification.
Recently, with the prevalence of deep learning methods, which have achieved impressive performance on many applications including image classification [19
], object recognition [20
], and semantic segmentation [21
], the feature representation of images has stepped into a new era. Unlike low-level and mid-level features, deep learning models can learn more powerful, abstract and discriminative features via deep-architecture neural networks without a considerable amount of engineering skill and domain expertise. All of these deep learning models, especially the convolutional neural network (CNN), are more applicable for remote sensing image scene classification and have achieved state-of-the-art results [22
]. Although the CNN-based methods have dramatically improved classification accuracy, some scene classes are still easily mis-classified. Taking the AID dataset as an example, the class-specific classification accuracy of ‘school’ is only 49% [35
], which is usually confused with ‘dense residential’. As shown in Figure 1
, two images labelled ‘school’ and two images labelled ‘dense residential’ have been selected from the AID dataset. We can see that the contexts among these four images have similar image distribution and all contain many buildings and trees. However, different from the arrangement irregularity of buildings in ‘school’, the buildings in ‘dense residential’ are arranged closely and orderly. This spatial layout difference between them is very helpful in distinguishing the two classes and should be given more consideration in the phase of classification. However, the use of the fully connected layer at the end of the CNN model compresses the two-dimensional feature map into a one-dimensional feature map and cannot fully consider the spatial relationship, which makes it difficult to distinguish the two classes.
Recently, the advent of the capsule network (CapsNet) [36
], which is a novel architecture to encode the properties and spatial relationship of the features in an image and is a more effective image recognition algorithm, shows encouraging results on image classification. Although the CapsNet is still in its infancy [37
], it has been successfully applied in many fields [38
] in recent years, such as brain tumor classification, sound event detection, object segmentation, and hyperspectral image classification. The CapsNet uses a group of neurons as a capsule to replace a neuron in the traditional neural network. In addition, the capsule is a vector to represent internal properties that can be used to learn part–whole relationships between various entities, such as objects or object parts, to achieve equivariance [36
] and can solve the problem of traditional neural networks using fully connected layers cannot efficiently capture the hierarchical structure of the entities in images to preserve the spatial information [50
To further improve the accuracy of the remote sensing image scene classification and motivated by the powerful ability of feature learning of deep CNN and the property of equivariance of CapsNet, a new architecture named CNN-CapsNet is proposed to deal with the task of remote sensing image scene classification in this paper. The proposed architecture is composed of two parts. First, a pretrained deep CNN, such as VGG-16 [51
], is fully trained on the ImageNet [52
] dataset, and its intermediate convolutional layer is used as an initial feature maps extractor. Then, the initial feature maps are fed into a newly designed CapsNet to label the remote sensing image scenes. Experimental results on three challenging benchmark datasets show that the proposed architecture achieves a more competitive accuracy compared with state-of-the-art methods. In summary, the major contributions of this paper are as follows:
To further improve classification accuracy, especially classes that have high homogeneity in the image content, a new novel architecture named CNN-CapsNet is proposed to deal with the remote sensing image scene classification problem, which can discriminate scene classes effectively.
By combining the CNN and the CapsNet, the proposed method can obtain a superior result compared with the state-of-the-art methods on three challenging datasets without any data-augmentation operation.
This paper also analyzes the influence of different factors in the proposed architecture on the classification result, including the routing number in the training phase, the dimension of capsules in the CapsNet and different pretrained CNN models, which can provide valuable guidance for subsequent research on the remote sensing image scene classification using CapsNet.
The remainder of this paper is organized as follows. In Section 2
, the materials are illustrated. Section 3
introduces the theory of CNN and CapsNet first, and then describes the proposed method in detail. Section 4
analyzes the influence of different factors, and discusses the experimental results of the proposed method. Finally, conclusions are drawn in Section 5
In recent years, the prevalence of deep learning methods especially the CNN has made the performance of remote sensing scene classification state-of-the-art. However, the scene classes with the same image distribution are still not distinguished effectively. This is mainly because some fully connected layers are added to the end of the CNN, which gives less consideration to the spatial relationship that is vital to classification. To preserve the spatial information, the new architecture CapsNet is proposed, which uses the capsule to replace the neuron in the traditional neural network. In addition, the capsule is a vector to represent internal properties that can be used to learn part-whole relationships within an image. In this paper, to further improve the classification accuracy of remote sensing image scene classification and inspired by the CapsNet, a novel architecture named CNN-CapsNet is proposed for remote sensing image scene classification. The proposed architecture consists of two parts: CNN and CapsNet. The CNN part is transferring the original remote sensing images to the original feature maps. In addition, the CapsNet part converts the original feature maps into various levels of capsules and to obtain the final classification result. Experiments were performed on three public challenging datasets, and the experimental results demonstrate the effectiveness of the proposed CNN-CapsNet and show that the proposed method outperforms the current state-of-the-art methods. In future work, different from using feature maps from only one CNN model, in this paper, feature maps from different pretrained CNN models will be merged for remote sensing image scene classification.