A Fast Deep Perception Network for Remote Sensing Scene Classiﬁcation

: Current scene classiﬁcation for high-resolution remote sensing images usually uses deep convolutional neural networks (DCNN) to extract extensive features and adopts support vector machine (SVM) as classiﬁer. DCNN can well exploit deep features but ignore valuable shallow features like texture and directional information; and SVM can hardly train a large amount of samples in an efﬁcient way. This paper proposes a fast deep perception network (FDPResnet) that integrates DCNN and Broad Learning System (BLS), a novel effective learning system, to extract both deep and shallow features and encapsulates a designed DPModel to fuse the two kinds of features. FDPResnet ﬁrst extracts the shallow and the deep scene features of a remote sensing image through a pre-trained model on residual neural network-101 (Resnet101). Then, it inputs the two kinds of features into a designed deep perception module (DPModel) to obtain a new set of feature vectors that can describe both higher-level semantic and lower-level space information of the image. The DPModel is the key module responsible for dimension reduction and feature fusion. Finally, the obtained new feature vector is input into BLS for training and classiﬁcation, and we can obtain a satisfactory classiﬁcation result. A series of experiments are conducted on the challenging NWPU-RESISC45 remote sensing image dataset, and the results demonstrate that our approach outperforms some popular state-of-the-art deep learning methods, and present high-accurate scene classiﬁcation within a shorter running time.


Introduction
The ever-advancing remote sensing technology now can generate a large number of high-resolution remote sensing images in a fast and effective way. This deftness has promoted its applications throughout numerous fields, including natural disaster monitoring, geospatial object detection, traffic supervision, weapon guidance and urban planning [1][2][3]. High-resolution remote sensing images, however, contain unique characteristics that make classifying scenes in them quite difficult. They usually stretch in different sizes, and contain diversified contents, like multi-directional targets standing against complex backgrounds. Some scenes in different classes may exhibit similar geographical features, while some belonging to the same class may look quite different, which easily leads scene classification Motivated by these advantages, we propose a fast deep perception network based on ResNet101 (Abbreviated as FDPResNet) that specifically targets scene classification for remote sensing images. First, a ResNet101 [9] model is trained on ImageNet to extract shallow and deep features of remote sensing images. Second, the shallow and the deep features are input into the deep perception module and converted to a set of depth-dense vectors. Finally, the vectors are input into the BLS-based pattern recognition system for training and classification. The contribution of our study can be summarized as follows, 1. Our study integrates DCNN and BLS into a framework to appropriate both DCNN's effective feature learning and BLS's fast decision-making. The proposed framework inherits can obtain the semantic information in high-resolution remote sensing images, as well as achieve fast pattern recognition. 2. We propose a deep perception model (DPModel)that can utilize both shallow and deep features of an image and extract richer semantic information from it. The model uses near-scale averaging operation to average the obtained shallow features, that is, integrating close convolutional layers into new convolutional layers that are then transformed into feature vectors through a vectorization operation. DPModel also adopts principal component analysis (PCA) [25] to avoid curse of dimensionality that arises with high dimensional vectors after features aggregation. Finally, the model cascades deep features and shallow features after dimension reduction from top to bottom and present new feature vectors that can present richer semantic information of the image.
The rest of the paper is organized as follows: Section 2 describes the principles and workflows of the proposed method; Section 3 presents the experiment and discussions. Conclusions are drawn in Section 4. Figure 1 explains the framework of the proposed FDPResNet. It consists of three steps. First, the high-resolution remote sensing images are input into a model that is pre-trained by ImageNet on ResNet101, where no retraining or fine-tuning the network is involved, and the shallow and the deep features are obtained. Second, the shallow features are integrated by close-scale averaging operations, and the nearby convolutional layers are converged to new convolutional layers. The features the new layers contain are then flattened into vectors. Here PCA is adopted to reduce the dimensions of these new vectors, and thus the reduced shallow features and the deep features are cascaded to form new depth-dense feature vectors. Third, the depth-dense feature vectors are trained in the BLS network for classification.

Feature Extraction
Residual neural networks are a kind of deep neural networks that are based on the highway networks, proposed by He et al. [9]. The network replaces the gateway unit in the highway network with a shortcut connection to reduce network parameters while preserving original information. The advent of ResNet is a milestone of the advancement of deep learning. ResNet can accelerate the training of ultra-deep neural networks while greatly improving the accuracy. It can also circumvent the tricky situation that the increasing number of network layers would incur gradient disappearance or gradient explosion, which makes training extremely deep networks possible. Recently, some researchers have proved that ResNet with only one neuron per hidden layer is a general function approximator. The identity mapping enhances the expression ability of deep networks, and also indicates that Resnet can reduce the redundancy of information in data [26]. Therefore, we choose ResNet101 as the backbone network of our approach. Current scene classification for remote sensing images usually extracts features from the convolutional layers of the pretrained CNN model or features of the most end (fully connected layers). The former contains local information and rich spatial data, while the latter contains semantic category information but lacks enough spatial information. Exploiting the features of the two kinds that mutually complement each other can provide a powerful representation. Therefore, FDPResNet extracts information from both the convolutional layer and the fully connected layer in this step.
This paper adopts a model that is trained by ResNet101 on ImageNet for feature extraction. The extracted features are divided into two categories: (1) The shallow features that have looped through the first convolution and the max pooling of the ResNet101 pre-trained model. (2) The deep features that have undergone the fifth convolution and the average pooling of the ResNet101 pre-trained model (close to the last layer of the classification performance).
The detailed process of extracting features is elaborated as follows. First, an image I that has been trimmed as n × n to fit training is input into the network. Second, The image loops through the network in a forward direction. Suppose there is a convolution layer L l , locating in the lth layer. After the image passes the layer L l , a m × m × d feature map M l can be obtained. For convenience, we denote map l = m × m. Thus for a feature map M l , the feature of each frame can be denoted as map i l , where 1 ≤ i ≤ d, d represents the overall dimension of the shallow feature, and the feature map M l can be represented as M l = map 1 l , map 2 l , ..., map d l ∈ R m×m×d . Therefore, the extracted shallow features can be denoted as: The deep features can be expressed as: Third, M 1 and M 5 are input into the depth perception module to obtain the final features that can represent the image.

Processing Shallow Features
The obtained feature M 1 as described in Section 2.1 is a multi-dimensional feature. Figure 3 displays the visualization of M 1 . It implies that not every channel feature can effectively represent the spatial information of the image. To this end, we propose a strategy of near-scale averaging to intelligently extract some shallow features from this d dimensional feature map, thus the features can represent texture information and spatial direction information.
As mentioned in Section 2.1, an m × m feature map can be denoted as map i l . Suppose F i x,y is a single eigenvalue in map i l , where 1 ≤ x ≤ m,1 ≤ y ≤ m, and F l ave represents mean of the eigenvalues of map i l .
For M 1 , the set of mean eigenvalues of each channel can be expressed as F 1 = {F 1 ave , F 2 ave , ..., F d ave }. Then the sum of values in F 1 set, denoted as MAP can be expressed as We need to find such a channel in M 1 : Thus, the channel that is the closest to the average can be obtained as C = {c 1 , c 2 , ..., c η }, and the feature set of the channel can be denoted as The results of several experiments we have conducted demonstrate that performance of scene recognition will reach the best when η = 3, Therefore, MAP aim can be flattened into a vector of 1 × Dim, denoted as F shallow .

PCA Reducing Dimensions
The dimensions of F shallow have been reduced to η after the previous step, but still contain redundant features. For example, suppose F shallow = 1 × (m × m × η). When m = 54, η = 3, the dimensions of the shallow feature is 1 × 8748, which will easily cause curse of dimensionality. Therefore, we adopt principal component analysis (PCA) to reduce the dimension of F shallow .
The core of PCA is to project each sample to the sample space in a direction toward large variance, that is, to find a large variance component in order to retain the original information as much as possible after dimension reduction. In other words, PCA avoids curse of dimensionality through discarding part of the information during dimension reduction. The features after PCA dimension reduction are also independent from each other [27].
In this paper, the reduced dimension is denoted as D PCA shallow . Given the key role that an appropriate dimension could play in classification, we conducted a grid search between [256,2048] with an nterval of 128 to find a perfect dimension, and find that when D PCA shallow = 512, the obtained shallow features achieve better scene classification. Thus, the feature after dimension reduction is denoted as F PCA shallow .

Aggregation of Features
The deep features obtained through feature extraction, as described in Section 2.1, represent the high-level semantic information and the shallow features describe the space information and texture information. Scene classification mostly depends on the semantic information, and uses the space and texture information as supplement. Therefore, our approach adopts a top-down aggregation to fuse the two kinds of features. First, the deep feature M 5 obtained through feature extraction is flattened into a vector F deep with a dimension 1 × d last ; Then, the two kinds of features can be aggregated as . PCA is also utilized here to find the optimal dimension, and results indicated that when dimension of F deep , d last is 2048, the best classification results can be achieved.

Broad Learning System
Broad learning system is actually a derivative variant of the random vector functional link neural network (RVFLNN) [28]. A BLS network [22], as exemplified as the purple box in Figure 1, works in three steps: First, the features of the input data mapping are used as the "feature nodes" of the network; second, the features of the mapping are elevated to "enhanced nodes" with randomly generated weights; third, all mapped features and enhanced nodes are directly connected to the output, and the corresponding output coefficients can be derived from the fast pseudo-inverse.
According to Section 2.2.3, the depth-dense vector is denoted F aggreage with a dimension D of 2048. Suppose the number of samples in the data is N, the input sample can be defined ad F ∈ R N×D , and F = F 1 aggreage , F 2 aggreage , ..., F N aggreage . The mapping feature in the BLS system is set to Z and the number of feature nodes are b, then the mapping feature of the dataset on the feature plane is where W e is the optimal input weight matrix obtained by sparse self-encoding. If k enhanced nodes are generated in BLS, and H is used to represent the enhanced feature matrix, the enhanced feature matrix of the dataset can be expressed as, where W h and β h represent the random matrix and bias respectively; φ(·) is an optional non-linear activation function, and tansig is selected as the excitation function in this paper. BLS is a merging matrix that connects feature nodes and enhancement nodes. The merging matrix is the actual input of BLS, that is, A= Z N×b |H N×k . We assume that the output matrix is L ∈ R N×C , where C represents the number of categories of the dataset, then the output matrix can be obtained according to BLS as, where W represents the connection weight matrix, and W is obtained by taking the ridge regression approximation of A + , as shown in Equation (9) Therefore, in a BLS system, the network only needs to learn the output matrix W. In this formula, λ is a regular l2-norm regularization, and set as λ = 2 −10 .

Dataset
We conducted a series of experiments on NWPU-RESISC45 dataset [21], which was created by a research team of the Northwestern Polytechnic University in 2017. It contains 31,500 remote sensing images and 45 scene categories. Each scene category contains 700 images of size 256 × 256. The spatial resolution of most images can reach 30 m∼0.2 m/pixel, and images in certain categories of special landforms may be in lower-resolution, like islands, lakes, regular mountains and snow mountains. This dataset includes abundant scene categories, and each category retains enough inner-diversity and inter-similarity with other classes. It is a challenging benchmark to test scene classification for remote sensing images.

Implementation Details
We randomly selected training samples and test samples from each category at a ratio of 2:8 and 1:9, respectively. Each set of experiments was repeated 10 times. The final classification performance was evaluated by the average of the accuracies of all experiments. We adopted the LibLinear library [16] to exert linear SVM training and testing.
All the experiments were conducted on a personal computer with a quad-core CPU of 2, 4 GHz, a graphics card of GeForece GT1080 8G GPU, and equipped with MathWorks MATLAB R2018b. Multiple pre-tests helped us to determine the values of key variables: η = 3, D PCA shallow = 512, λ = 2 −10 .

Effectiveness of Fusion of Shallow and Deep Features
We used ResNet101 to extract shallow features and deep features from images in NWPU-RESISC45, and used our approach to fuse the two levels of features. Then, three kinds of results were visualized by t-SNE and exhibited in Figure 4. Each dot in Figure 4 represents a sample of a category, and different categories are distinguished by different colors. The shallow features in Figure 4a are nearly messy lines of dots that demonstrate linear features, rather than clustering characteristics; while the deep features in Figure 4b are presented in a high level of abstraction with obvious clustering features, but the boundaries between classes are still blurred. The fusion of the two levels of features in Figure 4c indicates the final features of those categories. Compared to Figure 4a-c presents a clearer classification with increased distances and similarity between classes. The final results validate that our approach of feature fusion can effectively improve scene classification performance.

Comparison of the Accuracy
We compared the classification results of the proposed FDPResNet with those of seven state-of-the-art methods that were also conducted on NWPU-RESISC45 [21], as elaborated in Table 1. The accuracies in Table 1 indicate that our approach is superior to both transferred learning and fine-tuning methods [21]. Compared with the result of the D-CNN with VGGNet-16 [13] method that performed the best among the seven, the accuracy of our approach is 4% higher. The accuracy comparison demonstrates better effectiveness of our approach on NWPU-RESISC45.   [21] with SVM as the classifier. Here, the ratio of the training samples and test samples on the NWPU-RESISC45 dataset is 20%. Table 2 indicates that compared with the methods using SVM [21], FDPResNet saves approximately three times the training time and four times the test time. We also input the vectors obtained from our proposed DPModel into a SVM-based method for classification, and the resulted accuracy is higher than that of the six methods [21] and lower than that of the FDPResNet framework. This well demonstrates that features obtained by the DPModel can better represent the semantic information of the image, and has good generalization. As for operating efficiency, the training time and the test time that the DRResNet+SVM methods cost are greater than those of the FDPResNet, which means that BLS adopted in the FDPResNet framework can better fulfill scene classification for remote sensing images. Table 2.
Comparison of training time and testing time of different methods on the NWPU-RESISC45 dataset.

Confusion Matrix
To better understand the performance of our approach, we depicted a confusion matrix to illustrate the correctness of the classification results, as shown in Figure 5. Each row of the matrix represents the class predicted by our approach while each column represents the actual class. Thus, cells on the diagonal indicate the correct prediction while cells on other are as imply errors. Numbers in each cell represent the total number and the percentage of the predicted instances, and the classes are organized in a descending order of correctness along the diagonal from the left to the right.
In Figure 5, the recognition accuracies higher than 90% appear in more than 86% of the classes, and those lower than 80% only account for 4% of the classes. This further demonstrates that the FDPResNet is applicable to images with complex contents, like those in NWPU-RESISC45. Those scenes that the FDPResNet can recognize in high accuracy contain single texture and exhibit little within-class similarity, such as basketball_court, rectangular_ f armland, mountain; and those that the FDPRseNet recognized in low accuracy contain more complex contents, exhibit high similarities between-class and larger diversity within-class, such as wetland, commercial_area, intersection. While great success has been obtained so far, the problems of within-class diversity and between-class similarity are still two big challenges.

Conclusions
This paper proposes a novel fast deep perception network (FDPResNet) to utilize the expertise of DCNN and BLS. The FDPResNet uses a model pre-trained by ImageNet on the ResNet101 to extract both shallow features and deep features from an image, and then inputs these two kinds of features into the proposed DPModel to obtain a set of depth-dense vectors that can represent semantic information of the image. Consequently, BLS can utilize this set of deep dense vectors and outputs satisfactory scene classification results. The comparison experiments on the challenging dataset NWPU-RESISC45 [21] demonstrates that the FPDResNet can achieve optimal performance. Future work will focus on improving the FDPResNet's classification accuracy for scenes that are ambiguous.