1. Introduction
Classification is a fundamental task for remote sensing imagery analysis. Applying intelligent methods, such as pattern recognition and statistical learning, is an effective way to obtain class information of ground objects. It is always the main focus of research and commercial development. Early classification was mainly for low spatial resolution (10–30 m) images and pixel-leveled images, including unsupervised classification (also known as clustering, such as K-means [
1]) and supervised classification (such as Neural Networks [
2,
3] and Support Vector Machines [
4,
5]). These methods often use only spectral information of the images, and have formed general modules in commercial software, and have been successfully applied in land resources, environment, agriculture, and other fields. In recent years, some new approaches have appeared that are much superior to the traditional approaches. For example, Yuan Yuan et al. [
6] and Qi Wang et al. [
7] applied the latest achievements in the machine learning field, such as Manifold Ranking and Sparse Representation, to hyperspectral image classification.
High resolution (2 m spatial resolution and higher) remote sensing images contain more ground details. Many applications tend to obtain attributes of a ground object (such as a single building) rather than pixels. However, the pixel-level classification methods are sensitive to noise, and lack semantic meaning of the objects, and are difficult for obtaining object-level information. Therefore, object-oriented classification [
8] is proposed, and it has made great achievements in high resolution image classification. At present, eCognition [
9], ENVI [
10], and other commercial software have developed object-oriented classification modules. Most of the object-oriented approaches perform a “segmentation-classification” mode. In the segmentation stage, Multi-Resolution (MR) [
11], Full-Lambda Schedule (FLS) [
12], Mean-Shift [
13], Quadtree-Seg [
14], and other image segmentation approaches are used to generate image segments, which we called image objects. In the classification stage, object features (color, texture, and geometric features) are calculated, which are taken as inputs of supervised or unsupervised classification, or a manually designed rule set for feature filtering, to achieve the final class discrimination.
Land-cover has various types, and is affected by noise, illumination, season, and many other factors, and brings great difficulties to classification using high resolution images. Even using the object-oriented approaches, accurate classification is still very difficult. From the pattern recognition perspective, selection/extraction of representative features is the bottleneck to improving accuracy. That is, the use of a specific set of features cannot be achieved on the classification for all kinds of ground objects. Therefore, learning features automatically from a remote sensing data set rather than using manually designed features, and then performing classification on the learned features, is an effective way to improve the accuracy of classification.
Deep learning theory was explicitly proposed by Hinton et al. [
15] in 2006. It is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data [
16]. The basic motivation of deep learning is to establish a deep neural network to simulate the leaning and analysis mechanism of the human brain. Compared with the traditional machine learning theories, the most significant difference of deep learning is emphasizing automatic feature learning from a huge data set through the organization of multi-layer neurons. In recent years, various deep learning architectures such as Deep Belief Networks (DBN) [
17], Convolutional Neural Networks (CNN) [
18], and Recurrent Neural Networks (RNN) [
19] have been applied to fields like computer vision [
20,
21], speech recognition, natural language processing, audio recognition, and bioinformatics, and they have been shown to produce state-of-the-art results in these domains.
In deep learning techniques, CNN has achieved remarkable results in image classification, recognition, and other vision tasks, and has the highest score on many visual databases such as ImageNet, Pattern Analysis, Statistical Modeling and Computational Learning Visual Object Classes (PASCAL VOC), and Microsoft Common Objects in Context (MS-COCO). For image classification, the basic structure of the standard CNN is stacks of “convolutional-pooling” layers as multi-scale feature extractors, and subsequent numbers of fully connected layers as classifiers. Many works on CNN-based remote sensing image analysis emerged in recent years. Nguyen et al. [
22] presented an approach for satellite image classification using a five-layered network and achieved classification accuracy higher than 75%. Wang et al. [
23] used a CNN structure with three layers and Finite State Machine (FSM) for road network extraction for long-term path planning. Marco Castelluccio et al. [
24] explored the use of CNNs for the semantic classification of remote sensing scenes. Similarly, Hu et al. [
25] also classified different scenes from high resolution remote sensing imagery using a pre-trained CNN model. Weixun Zhou et al. [
26] employed CNN architecture as a deep feature extractor for high-resolution remote sensing image retrieval (HRRSIR). Volodymyr Mnih [
27] proposed a CNN-based architecture to learn large scale contextual features for aerial image labeling. The model produces a dense classification patch, instead of outputting a single value image category. Martin Lagkvist et al. [
28] presented a novel remote sensing imagery classification method based on CNNs for five classes (vegetation, ground, road, building, and water), outperforming the existing classification approaches. Besides the CNN family approaches, Yuan Yuan et al. [
6] used a Stacked AutoEncoder classifier for a classification experiment after using the Manifold Ranking based salient band selection.
The standard CNN is in an “image-label” manner and its output is the probability distribution over different classes. However, most of the remote sensing image classification expects a dense class map as the output, which has the same dimensions as the original image. A class map is a 2-D distribution of class labels with pixel correspondence, which is in a “pixel-label” mode. In the study of Martin Lagkvist et al. [
28], a “per-pixel” classification is considered using overlapped patches and average post-processing. However, the use of the overlapped patches introduces too much redundant computations, and the averaging processing may easily lose useful edge information. Based on the standard CNN, Jonathan Long et al. [
29] proposed the Fully Convolutional Network (FCN) model in 2015. By replacing fully connected (FC) layers in the standard CNN with convolutional layers, the FCN model maintains the 2-D structure of images, and firstly carries out CNN-based image semantic segmentation. In order to obtain a dense class map, Liang-Chieh Chen et al. [
30] used the “atrous” convolution instead of the ordinary convolution, increasing the density of the predicted class labels, and then performed the Conditional Random Fields (CRFs) as post-processing to refine the region boundaries. The CRFs-based boundary refinement is also used in the works of Sakrapee et al. [
31]. In order to integrate the CRFs procedure into the training stage, Shuai Zheng et al. [
32] applied the idea of RNN to image segmentation, implementing an “end-to-end” training procedure. In the remote sensing society, several studies employ FCN-based approaches for dense class map generation. Jamie Sherrah [
33] analyzed the down-sampling and up-sampling mechanism in CNNs, and adopted an FCN architecture for aerial image semantic labelling. The down-sampling mechanism of standard FCN is removed by involving deconvolution. D. Marmanis et al. [
34] also used FCN and subsequent deconvolution architecture to perform a semantic segmentation for aerial images. Emmanuel Maggiori et al. [
35,
36,
37] addressed the dense classification problem, and compared the patch-based CNN dense classification using CNN with FCN. With the advantages of FCN, the author proposed an end-to-end framework for large-scale remote sensing classification. A multi-scale mechanism was also considered by designing a specific neuron module that processes its input at multiple scales.
In this paper, we perform a FCN-based classification on high spatial resolution remote sensing imagery with 12 classes (bare land, grass, tree, water, building, cement ground, parking lot, playground, city road, trail, shadow, and others). These classes are typical ground objectives in city areas, and some of them (such as building, cement ground, road, and parking lot) are easily confused in traditional classification tasks. The class configurations were arranged to test the effectiveness of our approach in a complex environment. We fine-tuned the model parameters of the ImageNet-pretrained VGG-16 [
37] network using GF-2 satellite images, to adapt it to our remote sensing imagery classification task. The VGG network has a more compact structure of convolutional and pooling layers, and achieved the highest classification accuracy for ImageNet ILSVRC-2014. To overcome the noise caused by pixel-level classification, we refine the region boundaries using fully connected CRFs, following the procedure of Liang-Chieh Chen et al. [
30] and Sakrapee et al. [
31]. The refined output is more readily applied to an object-oriented analysis.
We compare our approach with the object-oriented approach with MR segmentation [
11] and SVM classification, patch-based CNN classification proposed in [
27], and the FCN-8s approach proposed in [
29], which achieved success for high resolution imagery classification or natural image segmentation. The result shows that our approach achieves higher accuracy in the classification. For those objectives which are difficult to be classified, our approach has lower confusion rates.