Combing Triple-Part Features of Convolutional Neural Networks for Scene Classiﬁcation in Remote Sensing

: High spatial resolution remote sensing (HSRRS) images contain complex geometrical structures and spatial patterns, and thus HSRRS scene classiﬁcation has become a signiﬁcant challenge in the remote sensing community. In recent years, convolutional neural network (CNN)-based methods have attracted tremendous attention and obtained excellent performance in scene classiﬁcation. However, traditional CNN-based methods focus on processing original red-green-blue (RGB) image-based features or CNN-based single-layer features to achieve the scene representation, and ignore that texture images or each layer of CNNs contain discriminating information. To address the above-mentioned drawbacks, a CaffeNet-based method termed CTFCNN is proposed to effectively explore the discriminating ability of a pre-trained CNN in this paper. At ﬁrst, the pretrained CNN model is employed as a feature extractor to obtain convolutional features from multiple layers, fully connected (FC) features, and local binary pattern (LBP)-based FC features. Then, a new improved bag-of-view-word ( i BoVW) coding method is developed to represent the discriminating information from each convolutional layer. Finally, weighted concatenation is employed to combine different features for classification. Experiments on the UC-Merced dataset and Aerial Image Dataset (AID) demonstrate that the proposed CTFCNN method performs significantly better than some state-of-the-art methods, and the overall accuracy can reach 98.44% and 94.91%, respectively. This indicates that the proposed framework can provide a discriminating description for HSRRS images.

In scene classification, the extraction of scene-level discriminative features is the key step to bridging the huge gap between an original image and its semantic category. In recent years, researchers have proposed various feature extraction methods, which can be divided into three main types: low-level methods, mid-level methods, and high-level methods [21,22]. Traditional scene classification methods were developed directly based on low-level features, such as texture features,

Convolutional Neural Network (CNN)
CNN is one of the most popular deep learning methods, and the main advantage is that original images can be directly input into the networks without complex pre-processing [57,58]. The typical CNNs are generally composed of convolutional layers, pooling layers, activation layers, fully-connected layers, and a softmax layer. Figure 1 shows the architecture of CaffeNet, which is a ypical CNN model [59]. As we can see from Figure 1, for an input HSRRS image, convolution computations are first performed by using convolution kernels with weight sharing, and feature maps can be obtained. Then the nonlinear activation function including rectified linear unit (ReLU) and sigmoid is introduced to enhance the expression ability. After that, mean pooling or max pooling are employed to reduce the parameters of the network and improve the translation invariance. As the depth of the layer increases, the feature maps tend to be highly abstract, and the feature maps from the last convolutional layer are flattened into a feature vector. The feature vector is further processed by the fully connected layers to form the global feature of a scene image, and it is fed into a softmax layer to gain the possibility for·each class.

Bag-of-View-Word (BoVW)
The BoVW model was originally proposed for natural language processing (NLP) and information retrieval (IR). Under this model, an image can be represented as a combination of many visual words. BoVW-based methods are widely applied in the computer vision field for simplicity and efficiency [60,61]. The main processes are summarized as follows: (1) Local image patch sampling and feature extraction. For an input image, local image patches are obtained by dense sampling or sparse sampling. Then, local descriptors are extracted for each sampled image patch.
(2) Constructing a dictionary (codebook) that consists of many visual words. K-means clustering is usually employed to learn a set of clustering centers from local features. Each clustering center can be regarded as a visual word, and then all visual words constitute a visual dictionary.
(3) Feature encoding. Local features are mapped into dictionary space by a feature encoding method, and encoding vectors can be generated. The dimension of encoding vectors is the number of visual words. Feature coding methods include vector quantization (VQ), sparse coding (SC), and so on.
(4) Feature pooling. The global representation of an image can be formed by gathering statistics of encoding vectors. The most frequently used methods are mean pooling and max pooling.

Local Binary Pattern (LBP) Descriptor
As a typical texture descriptor, LBP [62] is widely employed in many tasks, such as face recognition [63], image classification [64], and object detection [65]. In HSRRS scene classification, texture-coded mapped images can be explored as the inputs of deep networks to provide useful supplementary information. Figure 2 shows the principle of the LBP descriptor, which aims to obtain the local gray-scale distribution of an image by comparing the pixel values between the center pixel and its neighboring pixels. As shown in Figure 2, in the 3 × 3 spatial window, the LBP descriptor takes the gray value of the central pixel g 0 as the threshold and encodes by comparing g 0 with the gray values of eight surrounding pixels g i (i = 1, 2, . . . , 8). If g i is larger than g 0 , the code of pixel g i is assigned as "1" (binary number), otherwise it is assigned as "0". This process can be defined as After LBP operation, pixel g c ∈ [0, 255] can be obtained by clockwise connection, and it can be calculated as follows:

Proposed Framework
To extract the discriminating information in HSRRS images, a CaffeNet-based framework termed CTFCNN is proposed to improve the classification performance. The CTFCNN method extracts three types of features by using an off-the-shelf pre-trained CaffeNet model, and these features include multilayer convolutional features, features from fully connected layers, and LBP-based FC features. Furthermore, the operations of dimensionality reduction and feature fusion are employed to achieve an effective prediction of scene semantic category. The flowchart of the proposed CTFCNN framework is shown in Figure 3.

Convolutional Features
In CNNs, the feature map of each convolutional layer contains different discriminating information. To fully utilize the convolutional features of the pre-trained CNN model, a new coding method termed iBoVW is proposed to generate the features from each convolutional layer. Compared with the traditional BoVW method, iBoVW tries to achieve a more reasonable representation of a scene by fusing manifold learning and nonlinear coding. The detailed process is shown in Figure 4.
As in Figure 4, HSRRS images can be transformed into coding vectors through the iBoVW encoding operation. In detail, the iBoVW method includes offline parameter learning and feature extraction stages.
In the stage of offline parameter learning, unlabeled images are randomly chosen for the pre-trained CNN model, in which the FC layer has been removed. For each input image, the feature maps from the l-th convolutional layer can be obtained and regarded as ω × ω N-dimensional local features X = {x 1 , x 2 , . . . , x ω 2 } ∈ N . Then, projection matrix V and low-dimensional embedding Y = {y 1 , y 2 , . . . , y ω 2 } ∈ n are obtained by using locality preserving projection (LPP), which aims to minimize the following objective function : where D is the diagonal weight matrix, d ii = ∑ j w ij , L = D − W is the Laplacian matrix, and W is the weight matrix. Then K-means clustering is performed on low-dimensional embedding Y to learn the dictionary Dic.
In the features extraction stage, the given image is input into the CNN to obtain the feature maps of the l-th convolutional layer, and projection matrix V is employed to reduce the dimension of local descriptors. Then, non-linear coding is applied for processing low-dimensional embedding Y to get coding features φ, as in which y is the input vector, N V denotes the size of the visual vocabulary in the dictionary, d i = y − Dic i 2 denotes the Euclidean distance between y and Dic i , N k (•) represents the k-nearest neighbors space, and k represents the number of associations between local descriptors and the visual vocabulary. Max(•) and min(•) represent calculating the maximum and minimum values, respectively. Mean pooling is used to process all coding features of a scene, and deep global features from each convolutional layer can be obtained. Then, partial features are selected and fused by weighted concatenation as where p denotes the coefficient weight of different features, and conv (l) is the feature of the l-th convolutional layer.

Features from the Fully Connected Layer
In the phase of FC feature extraction, the pre-trained CNN model is employed as a feature extractor. Before feeding the HSRRS images to the model, each image should be adjusted to the fixed size. In the CTFCNN framework, the response from the first FC layer is extracted and regarded as the feature representation of a scene. Data augmentation is a common technique in deep learning. Each input image is transformed to expand the number of samples, including rotating (90 • , 180 • , and 270 • ) and flipping (horizontal and vertical). Then, the mean pooling method is employed to process the obtained six FC features to get FC-aug features; the correspondence is shown in Figure 3.

CNN-Based LBP Features
In HSRRS scene classification, LBP descriptors are usually integrated with feature coding models to achieve the representation of each image. However, these traditional LBP-based methods have limited discriminating ability in highly complex HSRRS images. Due to the powerful discrimination ability of deep neural networks, the LBP descriptor is integrated with the CNN model to make full use of the texture features of images and provide complementary information to the standard RGB deep model. Because the LBP descriptor is not suitable for direct input to the pre-trained CNN model, a new pre-processed solution is proposed to extract LBP-based FC (LBPFC) features.
Each channel of original images can be regarded as a gray image and generates a texture image through the LBP descriptor described in Section 2.3. Then, these texture images are superimposed to synthesize one image that contains three channels. Furthermore, the new obtained texture image needs to be adjusted to the fixed size. As for data augmentation, only rotation transformation (90 • , 180 • , and 270 • ) is performed. After obtaining four LBPFC features, the mean pooling method is used to achieve the global representations of images as well.

Feature Fusion and Classification
After extracting different types of features from the CNN, it is crucial to effectively fuse these features for classification. Due to the high dimensionality of the features, PCA is first employed to avoid the curse of dimensionality. After normalization, a weighted concatenation method is applied for the features as follows: in which q 1 , q 2 denote the coefficient weight of different features, respectively. Finally, the linear support vector machine (SVM) classifier is employed to predict the label of samples.

Experiments and Discussion
In this section, two public datasets are employed to evaluate our proposed methods, and the performance of scene classification is compared with some state-of-the-art algorithms.

Dataset Description
(1) UC Merced Land Use Dataset (UC-Merced dataset) [25]: The original images of this dataset were collected on the national map provided by the US Geological Survey. The dataset contains a total of 2100 images with a size of 256 × 256 pixels, and the spatial resolution is about 0.3 m. It is divided into 21 land-use scenes classes, such as agricultural, airplane, baseball diamond, and buildings. The UC-Merced dataset is difficult to classify because it contains a large number of similar land-use types, such as dense residential, medium residential, and sparse residential. As a public dataset, it is widely adopted to evaluate scene classification methods in remote sensing fields.  (2) Aerial Image Dataset (AID dataset) [22]: The dataset was collected by Wuhan University in Google Earth imagery. Each scene has 600 × 600 pixels, and the spatial resolution of its images are from 0.5 m to 8 m. The dataset contains a total of 10,000 images, which are divided into 30 semantic categories as in Figure 5b. The number of scene images per class ranges from 220 to 420, including airport, bare land, baseball field, and beach; the details are shown in Table 1. Since the AID is taken from different sensors, different countries, at different times, and in different seasons, the intraclass diversity of the AID is rapidly increasing. In addition, this dataset has smaller interclass dissimilarity as well. Details about the AID are available at http://captain.whu.edu.cn/project/AID/.

Experimental Setup
In each experiment, the dataset was randomly divided into training and test sets. In the UC-Merced dataset, 80% of samples were used for training, and the rest of the samples were employed for testing. In the AID, the ratio of training samples was set to 50%. After feature extraction, the public LIBLINEAR library [66] was employed to training the linear SVM classifier, and the penalty term C was tuned by a grid search with a given set 2 −10 , 2 −9 , . . . , 2 9 , 2 10 . Overall accuracy (OA) with standard deviation (STD) and confusion matrix were adopted to evaluate the performance of scene classification, and experiments were repeated 20 times in each condition.
In the phase of convolutional feature extraction, each image of the two datasets ws resized to 300 × 300 pixels to get the appropriate size of feature maps. Then, features from each convolutional layer could be obtained by CaffeNet with sizes of 73 × 73 × 96, 36 × 36 × 256, 18 × 18 × 384, 18 × 18 × 384, and 18 × 18 × 256. As for FC features and LBP-based FC features, the images were adjusted to 227 × 227 pixels before putting them into the CaffeNet, and a 4096-dimensional feature was extracted from the first FC layer.
All experiments were performed on a personal computer equipped with 16-GB memory, i5-8500 CPU, and 64-bit Windows 10 using MATLAB 2016a. The MATLAB toolbox called MatConvNet [67] was employed to exact different CNN features. The off-the-shelf CaffeNet [59] trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset can be download at http://www.vlfeat.org/matconvnet/pretrained/.

Parameter Evaluation
The CTFCNN method contains several hyper-parameters, including a visual vocabulary of a certain size N V and embedding dimension d of deep local features. In this section, we describe experiments conducted to evaluate the influence of hyper-parameters on classification performance.
To explore the classification performance with different sizes of visual vocabulary in iBoVW coding, each convolutional layer was discussed separately. Parameters N V in the UC-Merced and AID datasets were both tuned with a set of {10, 100, . . . , 3800}. Figure 6 shows the average OAs with respect to N V .  As can be seen from Figure 6, the performance of each convolutional feature becomes much better with large values of N V . The OAs first improve with the increase of N V and then maintain a stable value. The reason is that a larger dictionary contains more abundant information, which brings benefits to extract discriminant features for classification. However, if the visual vocabulary is too large, the dimension of features will be high and lead to a great increase in the computation complexity. Based on the above analysis, the value of N V was set to 2400 for the UC-Merced dataset and 3000 for AID in the following experiments.
To  According to Figure 7, with the increase in d, the OAs first increase and then remain stable, for the reason that low-dimensional features may lose a lot of useful information. The vertical lines in the figure represent the original dimension of the local deep features on each convolutional layer. It is obvious that the classification accuracy can be improved after employing LPP for dimensionality reduction because LPP can maintain the local geometric structure. To achieve better classification results, parameter d is set to 95, 225, 350, 350, and 235 for each convolutional layer.

Comparison and Analysis of Proposed Methods
To illustrate the effectiveness of the iBoVW encoding method, OAs obtained from each convolutional layer were compared between the UC-Merced dataset and AID dataset. The value for parameter k for the two dataset was set to 5 and 10, respectively. Figure 8 shows the OAs of different feature encoding methods.  As shown in Figure 8, the proposed iBoVW method achieves a higher OA than the BoVW method in each convolutional layer. The reason is that the traditional BoVW model implements feature coding by using the VQ method, which is a hard assignment method, and VQ assumes that the feature vectors are only related to one visual vocabulary in the dictionary. However, in real applications, the feature vectors are often correlated with multiple clustering centers. Compared with the BoVW method, the proposed iBoVW method adopts non-linear coding to fully consider the correlation between input vectors and multiple clustering centers. Figure 9 reports the overall accuracy of different CaffeNet-based features. The classification performance of deeper convolutional layers is much better than the lower convolutional layers. Compared with features from single convolutional layer, multilayer feature fusion achieves better results because different convolutional layers contain diverse information, and the ability of discrimination can be improved by fusion. For FC features, data augmentation is helpful to improve classification accuracy. In the UC-Merced dataset, OA was improved from 95.95% to 97.14%. In the AID dataset, it reached 92.02%. Furthermore, it is obvious that our LBP-based CNN method is superior to the mid-level methods, such as LLC, BoVW, VLAD, and IFK. Compared with the method proposed by Levi et al. [68], the proposed pre-processing method is more suitable for LBPFC feature extraction. In summary, the CTFCNN method achieves the highest classification accuracy, which indicates that the strategy of triple-part feature fusion is more effective for scene classification.    Figure 10 shows the confusion matrices of four methods on the UC-Merced dataset, including single-feature-based methods and the CTFCNN method. It is clear that the CTFCNN method obtains the highest classification performance, and the accuracy of most categories is close to 100%. This shows that the CTFCNN method fully explores the discriminating ability of the pre-trained CNN. Furthermore, our proposed method achieves a relatively good classification accuracy in some high similarity scene categories, such as dense residential (100%), medium residential (95%), and sparse residential (100%). These semantic classes contain the same types of ground objects (e.g., building, road, and vegetation) and are visually similar, so they are prone to be misclassified. The CTFCNN method still obtains a competitive classification performance, which proves the effectiveness of the feature fusion framework.
From Table 2, we see that the CNN-based methods have achieved more satisfactory results than mid-level methods (i.e., VLAD, VLAT, and MS-CLBP+FV). In [22], three kinds of pre-trained CNN models were employed as feature extractors, and the features from the first fully connected layer were used for classification. In [44,46,[51][52][53]71], deep features from pre-trained CNNs were reprocessed to achieved excellent performance. In [73], a deep-learning-based classification method was presented to improve classification performance by combining pre-trained CNNs and extreme learning machine (ELM). In [56,72,74], deep features and hand-crafted features were combined to get a discriminative scene presentation. In [54,55], multilayer features based on a convolutional neural network were fused to get better results. In addition, the work reported in [75,76] attempted to adopt a multiscale feature fusion strategy. In contrast, our proposed method provides an improvement over recent CNN-based methods and focuses on combing triple-part features of a CNN model for classification. On the UC-Merced dataset, the highest classification accuracy of the CTFCNN was 98.44%.
Compared with the UC-Merced dataset, the AID dataset is available later (in 2017), and it is larger scale dataset. However, the CTFCNN method achieved excellent classification results (94.91%) on the AID dataset. As can be seen from Table 3, the performance of our proposed method is better than some state-of-the-art methods, which are based on pre-trained CNNs. In [78,79], a two-stage deep feature fusion method and multilevel fusion methods achieve satisfactory results. The reason is that they adopt multiple types of CNN models, including CaffeNet and VGG-VD-16, while our method only employs one type of CNN model. The classification result in [43] is good, because CNN and CapsNet are integrated, and feature maps from a pre-tained VGG-16 are fed into CapsNet and participated in fine-tuning. The above analysis indicates that the proposed CTFCNN method can effectively extract the discriminating features of scenes to achieve better classification. Table 3. Overall accuracy (mean ± SD) comparison of recent methods under the training ratio of 50% on the AID dataset.

Conclusions
In this paper, we proposed a CTFCNN framework to fully exploit the discriminant ability of a pre-trained CaffeNet. In this framework, CaffeNet is employed as a feature extractor to get multilayer convolutional features, features from the fully connected layer, and LBP-based FC features. Then, the iBoVW method is developed to process the convolutional features, which employs LPP and a nonlinear coding method. For the LBP-based FC features, a new solution is proposed to integrate texture images and pre-trained CNN models. Finally, three features are combined by weighted concatenation. As a result, the proposed framework can effectively achieve the representations of HSRRS images and improve the performance of scene classification. Experimental results on two public datasets (UC-Merced and AID) show that the CTFCNN method obtains much better results than some state-of-the-art methods in terms of overall accuracy, and the highest OAs achieved were 98.44% and 94.91%, respectively. In the future, we will incorporate fine-tuning techniques and focus on combing features from regions of interest (ROIs) to further improve the classification performance.