Transferring Pre-Trained Deep CNNs for Remote Scene Classiﬁcation with General Features Learned from Linear PCA Network

: Deep convolutional neural networks (CNNs) have been widely used to obtain high-level representation in various computer vision tasks. However, in the ﬁeld of remote sensing, there are not sufﬁcient images to train a useful deep CNN. Instead, we tend to transfer successful pre-trained deep CNNs to remote sensing tasks. In the transferring process, generalization power of features in pre-trained deep CNNs plays the key role. In this paper, we propose two promising architectures to extract general features from pre-trained deep CNNs for remote scene classiﬁcation. These two architectures suggest two directions for improvement. First, before the pre-trained deep CNNs, we design a linear PCA network (LPCANet) to synthesize spatial information of remote sensing images in each spectral channel. This design shortens the spatial “distance” of target and source datasets for pre-trained deep CNNs. Second, we introduce quaternion algebra to LPCANet, which further shortens the spectral “distance” between remote sensing images and images used to pre-train deep CNNs. With ﬁve well-known pre-trained deep CNNs, experimental results on three independent remote sensing datasets demonstrate that our proposed framework obtains state-of-the-art results without ﬁne-tuning and feature fusing. This paper also provides baseline for transferring fresh pre-trained deep CNNs to other remote sensing tasks.


Introduction
Remote sensing image processing achieves great advances in recent years, from low-level tasks, such as segmentation, to high-level ones, such as classification [1][2][3][4][5][6][7]. However, the task becomes incrementally more difficult as the level of abstraction increases, going from pixels, to objects, and then scenes. Classifying remote sensing images according to a set of semantic categories is a very challenging problem, because of high intra-class variability and low inter-class distance [5][6][7][8][9]. Different objects may appear at different scales and orientations in a given class, and the same objects may be found in images belonging to different classes. By constructing a holistic scene representation, the bag-of-visual-words (BOW) model becomes one of the most popular approaches for solving the scene classification problem in the remote sensing community [10]. In addition, many variant methods based on the BOW model have been developed for improving the discriminative ability of the "visual words" [11][12][13]. Nevertheless, the representations generated from BOW are still in mid-level form and not sufficiently powerful for scene classification. Therefore, more representative and higher-level representations are desirable and will certainly play a dominant role in scene-level tasks.
Deep learning algorithm attempts to learn high-level features corresponding to high level of abstraction. The deep convolutional neural network (CNN) [14], which is acknowledged as the most successful and widely used deep learning model, is now the dominant method in the majority of recognition and detection tasks. Its recent impressive results for computer vision applications bring dramatic improvements beyond the state-of-the-art records on a number of benchmarks [15][16][17][18]. In remote sensing field, the use of deep learning is rapidly growing. A considerable number of works propose deep strategies for spatial and spectral feature learning [3,[19][20][21]. Vakalopoulou et al. [3] propose an automated building detection framework from very high resolution remote sensing data based on deep convolutional neural networks. In [19], deep convolutional neural networks are employed to classify hyperspectral images directly in spectral domain. In addition, Makantasis et al. [20] propose a deep learning based method that exploits a CNN to encode pixels' spectral and spatial information and constructs high-level features of hyperspectral data in an automated way. Furthermore, Hamida et al. [21] design a lightweight CNN architecture to process spectral and spatial information of hyperspectral data, and provide a less costing solution while ensuring an accurate classification of the hyperspectral data. In theory, considering the subtle differences among categories in remote scene classification, we may attempt to form high-level representations for remote sensing images from CNN activations. However, the acquisition of large-scale well-annotated remote sensing image datasets is costly, and it is easy to over-fit when we try to train a high-powered deep CNN with small datasets in practice [22]. On the other hand, even though we have obtained large enough remote sensing datasets, learning billions parameters in these deep CNNs is very time-consuming. ImageNet (http://www.image-net.org/challenges/LSVRC/) is a large-scale dataset, which offers a very comprehensive database of more than 1.2 million categorized natural images of 1000+ classes [23]. Deep CNN models trained upon this dataset serve as the backbone for many segmentation, detection and classification tasks on other datasets. Moreover, some very recent works have demonstrated that the representations learned from deep CNNs pre-trained on large datasets such as ImageNet can be transferable to image classification task [24]. Some works also start to apply them to remote sensing field, and obtain state-of-the-art results for some specific datasets [22,25,26]. Penatti et al. [25] evaluate the generalization power of experimentally CNNs trained for recognizing everyday objects for the classification of remote sensing images. Castelluccio et al. [22] explore the use of pre-trained deep CNNs for the classification of remote scenes. The pre-trained networks are fine-tuned on the target data, to avoid overfitting problems and reduce design time. In [26], features from various successfully pre-trained deep CNNs are transferred for remote scene classification. Via extracting CNN features from different layers, the proposed framework results in remarkable performance even with a simple linear classifier. However, the generalization power of deep features learned from deep CNNs fades evidently when the features of remote sensing images become different in space and spectrum with that of natural images in the ImageNet dataset [22,25]. Therefore, a foreseeable question is that how can we further enhance the generalization power of pre-trained deep CNNs for remote sensing imagery.
PCA network (PCANet) is a simple but effective neural network, which mainly comprises three components: cascaded principal component analysis (PCA), binary hashing, and block-wise histograms [27]. In the PCANet model, there are no nonlinear operations in its early stages, until the very last output layer. Moreover, filters learning in the PCANet does not involve regularized parameters or require numerical optimization solvers. Namely, it is unsupervised. In our experiments, we apply a simple and shallow linear PCANet to the remote sensing images before transferring the pre-trained deep CNNs to them. We find that features learned from this framework improve the remote scene classification performance. To our surprise, this framework works well even in the condition that the remote sensing images are very different in space and spectrum with the natural images from ImageNet dataset that is used to pre-train the deep CNNs. Inspired by this, we evaluate the performance of this framework for remote scene classification in different conditions, and explore the way in which the LPCANet synthesizes spatial and spectral information of remote sensing images and enhances the generalization power of pre-trained deep CNNs.
Therefore, in our work, we propose a framework to obtain general features from the pre-trained deep CNNs for remote scene classification and attempt to form a baseline for transferring pre-trained deep CNNs to remote sensing images with various spatial and spectral information. By applying a shallow LPCANet to the remote sensing images, we generate features with particular spatial and spectral form, which serve as inputs of the pre-trained deep CNN. Then, we remove the output layer of the pre-trained deep CNN and see the remainder of it as a fixed feature extractor. The obtained features of the image scenes are fed into a simple classifier for the scene classification task. We propose two scenarios to test the performance of the LPCANet on extracting general features for pre-trained deep CNNs in space and spectrum, respectively: (1) By applying a shallow LPCANet to each spectral channel of the remote sensing images, we test the performance of LPCANet on extracting general features for pre-trained CNNs in spatial information.
(2) Furthermore, we introduce quaternion algebra to LPCANet and design the linear quaternion PCANet (LQPCANet) to further extract general features for pre-trained CNNs from spectral information and test its performance for different remote sensing images.
We conduct extensive experiments with different pre-trained deep CNNs such as CaffeNet [17], GoogLeNet [28] and ResNet [29]. Based on various pre-trained deep CNNs, we evaluate our proposed framework on different remote sensing datasets that vary in space and spectrum. The results show that our proposed framework can enhance the generalization power of pre-trained deep CNNs and learn better features for remote scenes. With "unsupervised settings", our proposed framework achieves state-of-the-art performance on some public remote scene datasets.
Our proposed framework hardly contains any deep or new techniques, and our study so far is mainly empirical. However, a thorough report on such a baseline system has tremendous value for transferring pre-trained deep CNNs to remote sensing images that vary in space and spectrum. Our main contributions are summarized as follows: (1) We thoroughly investigate how the LPCANet and LQPCANet synthesize spatial and spectral information of the remote sensing imagery and how can them enhance generalization power of pre-trained deep CNNs for remote scene classification. (2) For future study, our proposed framework can serve as a simple but surprisingly effective baseline for empirically justifying advanced designs of transferring pre-trained deep CNNs to remote sensing images. We can take any pre-trained deep CNN as a starting point and improve the network further with our proposed method. (3) Our proposed features learning framework is under the "unsupervised settings", which is an encouraging orientation in deep learning, and is more promising for remote sensing tasks compared with supervised or semi-supervised method.
The rest of the paper is organized as follows. Section 2 provides the whole framework of our proposed method. Section 2.1 presents successful pre-trained deep CNNs nowadays and the development of them. Section 2.2 introduces LPCANet and its quaternion representation, which form the foundation of our proposed two architectures explained in Section 2.3. Experiments are presented in Section 3 and we conclude the paper in Section 4 with some remarks in Section 5.

Pre-Trained Deep Convolutional Neural Networks
According to the biological template discovered by Hubel and Wiesel in 1959, the visual cortex of our brain is organized in layers [30]. The lower layers extract basic features of images, such as spots, lines, and corners. The higher layers combine these basic features to form templates that are more complex. Inspired by this, Fukushima [31] first proposed the convolutional neural networks in 1980, which was then refined by LeCun in 1989 [32]. Thanks to fast growth of affordable computing power, especially graphical processing units (GPUs), and the diffusion of large datasets of labeled images for training, a seminal deep convolutional neural network called AlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012) [33], and brought increasing interest for deep CNNs in the last few years. The typical architecture of a deep CNN is composed of multiple cascaded layers with various types. The convolutional layers convolve input feature maps with a set of weights (also called kernels or filters) to generate new feature maps. The deeper convolutional layers are able to learn features that are more abstract by combining lower-level ones learned in former layers. After convolutional layer, a non-linear activation function, such as sigmoid unit, is applied to improve generalization of learned feature maps. The pooling layers perform downsampling operation on local regions of feature maps to reduce the dimension of input feature maps and provide translation invariance at the same time. The fully-connected layers finally follow several stacked convolutional and pooling layers, and the last fully-connected layer is a Softmax layer that computes the scores for each defined class. The parameters of CNNs are typically trained with classic stochastic gradient descent based on the backpropagation algorithm. With well trained parameters, CNNs transform the input images to high-level feature maps in a feedforward manner.
Based on the typical deep CNN, AlexNet replaces the sigmoid unit with the rectified linear unit (ReLU), which allows much faster training. On the other hand, it uses dropout technique to alleviate the effect of over-fitting [15]. Moreover, CaffeNet further places the non-linear activation functions after pooling layers [17]. Very recently, there are two major directions, in which a lot of efforts are made to update the typical deep CNN, and drive it to achieve better performance in computer vision tasks.
The first direction is to make CNNs deeper. VGG-VD networks developed by Simonyan et al. [18] are very deep CNN models, which won the runner-up in ILSVRC-2014. Known as two successful very deep CNN models, VGG-VD16 and VGG-VD19 demonstrate that the depth of the network plays a significant role in improving classification accuracy. Furthermore, MSRA-Net is designed deeper by replacing the 5 × 5 filters with two series 3 × 3 filters [34]. It achieves better performance and reduces computational complex at the same time.
The other direction is to renovate the typical layers in deep CNNs. Network in Network (NIN) [35] replaces the linear convolutional layer with multilayer perceptron called MLPconv layer. In addition, instead of fully-connected layers, it uses global average pooling to obtain output features. GoogLeNet is the CNN architecture that won the ILSVRC-2014 competition, which contains 24 layers [28]. Inspired by "Network in Network" idea, it uses the inception modules as shown in Figure 1, which employs filters of different sizes at each layer and reduces the number of parameters at the same time. Furthermore, the inception module is modulated in the CNN architecture of Inception V3 [36], in which two series 3 × 3 filters are used to take place of the 5 × 5 filters. Moreover, 1 × n and n × 1 convolutional kernels derived from n × n operation reduce parameters in the network and economize the computational cost. Figure 2 depicts the changes of inception module in the architecture of Inception V3. Derived from the architecture of Inception V3, Inception V4 network benefits from new inception module, which is more complex and deeper [37].
By integrating the two directions discussed above, deep residual network (ResNet) [29] that won the 1st place in the ILSVRC-2015 reformulates the layers in it as learning residual functions with reference to layer inputs, instead of learning unreferenced functions such as convolutional operation. At the same time, it is easy to optimize, and can gain accuracy from increased depth. ResNet achieves great success on the ImageNet dataset with a depth of up to 152 layers-8× deeper than VGG nets. Based on ResNet, identity mapping residual net further optimizes the residual learning framework, and achieves better performance with considerable margin [37]. same time. Furthermore, the inception module is modulated in the CNN architecture of Inception V3 [36], in which two series 3 × 3 filters are used to take place of the 5 × 5 filters. Moreover, 1 × n and n × 1 convolutional kernels derived from n × n operation reduce parameters in the network and economize the computational cost. Figure 2 depicts the changes of inception module in the architecture of Inception V3. Derived from the architecture of Inception V3, Inception V4 network benefits from new inception module, which is more complex and deeper [37].  By integrating the two directions discussed above, deep residual network (ResNet) [29] that won the 1st place in the ILSVRC-2015 reformulates the layers in it as learning residual functions with reference to layer inputs, instead of learning unreferenced functions such as convolutional operation. At the same time, it is easy to optimize, and can gain accuracy from increased depth. ResNet achieves great success on the ImageNet dataset with a depth of up to 152 layers-8× deeper than VGG nets. Based on ResNet, identity mapping residual net further optimizes the residual learning framework, and achieves better performance with considerable margin [37].
In summary, Figure 3 briefly demonstrates the evolution of CNNs' structure. Not strictly separated, the two channels in Figure 3 are used to depict the two mainstream ideas in which typical deep CNN is updated to achieve successful performance. However, these successful deep CNNs discussed above do not achieve good performance as we expected, when we directly apply them to remote sensing images. In fact, almost all successful deep CNNs are trained on daily natural image datasets, such as ImageNet [23], because huge amounts of labeled daily images are available online. In the field of remote sensing, limited training data in remote sensing datasets brings overfitting when we attempt to train a deep CNN, and the deep CNN trained by limited training data dose not generalize well to test data.
An effective solution, recently explored in [22,25,26], is to transfer deep features trained on ImageNet dataset to remote sensing images. This solution derives from that, in the lower layers of a deep CNN, features learned from both the daily nature images and remote sensing images are alike, In summary, Figure 3 briefly demonstrates the evolution of CNNs' structure. Not strictly separated, the two channels in Figure 3 are used to depict the two mainstream ideas in which typical deep CNN is updated to achieve successful performance. By integrating the two directions discussed above, deep residual network (ResNet) [29] that won the 1st place in the ILSVRC-2015 reformulates the layers in it as learning residual functions with reference to layer inputs, instead of learning unreferenced functions such as convolutional operation. At the same time, it is easy to optimize, and can gain accuracy from increased depth. ResNet achieves great success on the ImageNet dataset with a depth of up to 152 layers-8× deeper than VGG nets. Based on ResNet, identity mapping residual net further optimizes the residual learning framework, and achieves better performance with considerable margin [37].
In summary, Figure 3 briefly demonstrates the evolution of CNNs' structure. Not strictly separated, the two channels in Figure 3 are used to depict the two mainstream ideas in which typical deep CNN is updated to achieve successful performance. However, these successful deep CNNs discussed above do not achieve good performance as we expected, when we directly apply them to remote sensing images. In fact, almost all successful deep CNNs are trained on daily natural image datasets, such as ImageNet [23], because huge amounts of labeled daily images are available online. In the field of remote sensing, limited training data in remote sensing datasets brings overfitting when we attempt to train a deep CNN, and the deep CNN trained by limited training data dose not generalize well to test data.
An effective solution, recently explored in [22,25,26], is to transfer deep features trained on ImageNet dataset to remote sensing images. This solution derives from that, in the lower layers of a deep CNN, features learned from both the daily nature images and remote sensing images are alike, such as blobs and edges. These features are general enough to be useful in both the two kinds of datasets, and thus the high-level features in deep CNNs computed from daily nature images may be powerful representations for remote sensing images. However, this transferring operation depends on an important principle: the "distance" of the source dataset on which the deep CNN is trained and the target dataset to which the deep features are transferred should be small enough. In this paper, we define "distance" as the degree of difference in spatial and spectral information between source However, these successful deep CNNs discussed above do not achieve good performance as we expected, when we directly apply them to remote sensing images. In fact, almost all successful deep CNNs are trained on daily natural image datasets, such as ImageNet [23], because huge amounts of labeled daily images are available online. In the field of remote sensing, limited training data in remote sensing datasets brings overfitting when we attempt to train a deep CNN, and the deep CNN trained by limited training data dose not generalize well to test data.
An effective solution, recently explored in [22,25,26], is to transfer deep features trained on ImageNet dataset to remote sensing images. This solution derives from that, in the lower layers of a Remote Sens. 2017, 9, 225 6 of 26 deep CNN, features learned from both the daily nature images and remote sensing images are alike, such as blobs and edges. These features are general enough to be useful in both the two kinds of datasets, and thus the high-level features in deep CNNs computed from daily nature images may be powerful representations for remote sensing images. However, this transferring operation depends on an important principle: the "distance" of the source dataset on which the deep CNN is trained and the target dataset to which the deep features are transferred should be small enough. In this paper, we define "distance" as the degree of difference in spatial and spectral information between source and target datasets. In order to reduce the "distance" between remote sensing images and daily nature images, we design LPCANet and LQPCANet to synthesize the spatial and spectral information of remote sensing images respectively. By doing this, the generalization power of CNN pre-trained on ImageNet is enhanced for remote scene classification.

LPCANet and Its Quaternion Representation
In this section, we design the structure of LPCANet, which derives from the PCANet [27]. We try to synthesize spatial information of remote sensing images through it. On the other hand, we introduce the quaternion algebra into LPCANet, and further synthesize spectral information of remote sensing images. Stage of hashing and histograms in PCANet is replaced by stages of weighting and hashing in LPCANet and LQPCANet to guarantee the linear property throughout all the operations in them. By doing this, the principle features of remote sensing images are learned and then sent to deep CNNs, which are pre-trained on large-scale datasets such as ImageNet [23]. The structure of LPCANet (LQPCANet) is depicted in Figure 4, within which the quaternion PCA filters are shown in broken lines. Suppose that we have N input remote sensing images {I i } N i=1 of size m × n × 3 and corresponding labels for training. Then, the input images I i ∈ R m×n×3 N i=1 can be concatenated as follows: In the following, we describe the structure of LPCANet in detail.
Remote Sens. 2017, 9, 225 6 of 26 nature images, we design LPCANet and LQPCANet to synthesize the spatial and spectral information of remote sensing images respectively. By doing this, the generalization power of CNN pre-trained on ImageNet is enhanced for remote scene classification.

LPCANet and Its Quaternion Representation
In this section, we design the structure of LPCANet, which derives from the PCANet [27]. We try to synthesize spatial information of remote sensing images through it. On the other hand, we introduce the quaternion algebra into LPCANet, and further synthesize spectral information of remote sensing images. Stage of hashing and histograms in PCANet is replaced by stages of weighting and hashing in LPCANet and LQPCANet to guarantee the linear property throughout all the operations in them. By doing this, the principle features of remote sensing images are learned and then sent to deep CNNs, which are pre-trained on large-scale datasets such as ImageNet [23]. The structure of LPCANet (LQPCANet) is depicted in Figure 4, within which the quaternion PCA filters are shown in broken lines. Suppose that we have N input remote sensing images { } =1 In the following, we describe the structure of LPCANet in detail.     Assuming that the patch size (or two-dimensional filter size) is k 1 × k 2 , where k 1 and k 2 are odd integers and satisfy 1 ≤ k 1 ≤ m, 1 ≤ k 2 ≤ n. With zero-padded boundary, we use a patch of size k 1 × k 2 to slide each pixel of the ith remote sensing image I i ∈ R m×n×3 in each spectral channel respectively, and collect all overlapping patches of the ith image in each spectral channel. Then, we subtract patch mean from each patch and reshape each k 1 × k 2 matrix into a column vector, which is then concatenated to obtain matrix P j i = p j i,1 , p j i,2 , · · · , p j i,mn ∈ R k 1 k 2 ×mn , where j = 1, 2, 3 denotes the distinct spectral Remote Sens. 2017, 9, 225 7 of 26 channel. Repeating the above process, we can construct the same matrix for all input images. Putting them together, we obtain Assuming that the number of PCA filters is L, the PCA algorithm minimizes the reconstruction error of P j in Frobenius norm as follows: where I L is an identity matrix of size L × L. By using eigenvalue decomposition method, the solution of Equation (3) is the L leading principal eigenvectors of P j P j T , which are arranged in decreasing magnitude order and can be shown as Therefore, the PCA filters learned from each spectral channel of remote sensing images can be obtained by where mat k 1 ,k 2 u j l is a function that maps u j l ∈ R k 1 k 2 to a matrix V j l ∈ R k 1 ×k 2 . This filters bank captures the main variation of all of the mean-removed training patches. In Section 2.2.2, we will use the learned filters bank to extract the feature maps from each spectral channel of remote sensing images by convolutional operation.
B. Learning QPCA Filters Bank from Remote Sensing Images.
By applying quaternion algebra to the input remote sensing images I i ∈ R m×n×3 N i=1 , we can obtain the representation of remote sensing images in quaternion domain. As to the ith remote sensing image I i ∈ R m×n×3 , it can be represented as follows: where R i (x, y), C 1,i (x, y), C 2,i (x, y) and C 3,i (x, y) are real values of the pixel at position (x, y), and 1 ≤ x ≤ m, 1 ≤ y ≤ n. i, j and k are three imaginary units, which represent the spectral channels and obey the following rules: Furthermore, we set R i (x, y) ≡ 0. Then, the ith remote sensing image can be further represented as a pure quaternion: In addition, we set the patch size as k 1 × k 2 , and collect all the quaternion patches around each pixel of the ith remote sensing image. Then we subtract patch mean from each quaternion patch and reshape each k 1 × k 2 matrix into a column vector, which is a hypercomplex vector and belongs to H k 1 k 2 . H denotes the field of quaternion numbers. Then we concatenate these column vectors to obtain matrix Q i = q i,1 , q i,2 , · · · , q i,mn ∈ H k 1 k 2 ×mn . Thus, for all input remote sensing images, we obtain: Assume that the number of QPCA filters is L. We can obtain the L leading principal eigenvectors of QQ T by conducting the quaternion eigenvalue decomposition method for covariance matrix of Q.
The L leading principal eigenvectors can be then mapped as the L QPCA filters: By using the QPCA filters bank to convolve the remote sensing images, we not only synthesize the special information of them, but also their spectral information. Moreover, Non-commutatively under multiplication is an important characteristic of the quaternion algebra. After the QPCA operation, the relative relationship of spectral channels is enhanced, and the distinct meaning of each spectral channel is weakened.

Encoding Feature Maps by Convolutional Operation
By respectively convolving the learned PCA and QPCA filters bank with the ith input remote sensing image, we obtain the feature maps as denoted in Equations (10) and (11): where * denotes two-dimensional (2-D) convolution, and the superscript 1 denotes the first layer of feature maps encoded by convolutional operation. The boundary of I i is zero-padded before convolving with V l or V l in order to ensure that I 1 i,l and I i have the same size. Therefore, after convolving with PCA filters bank, the ith input remote sensing image I i is transformed into L feature maps in each spectral channel as I On the other hand, after convolving with QPCA filters bank, the ith input remote sensing image I i is transformed into L quaternion feature For the N input remote sensing images {I i } N i=1 , we can obtain the set of feature maps I 1 i N i=1 after convolutional operation above. Then, the feature maps I 1 i N i=1 can be concatenated as follows: 1,1 · · · I 1 1,L , I 1 2,1 · · · I 1 2,L , · · · · · · , I 1 N,1 · · · I 1 N,L ] ∈ R m×NLn×3 (12)

Feature Maps Weighing and Pooling
We weight the feature maps encoded by convolutional operation in order of importance that the principal features are arranged. Then, we pool the weighted feature maps to further enhance shift invariance of the features.
The weighting process can be depicted as follows: where L is the number of PCA or QPCA filters in Section 3.1, and the superscript 1 denotes the first weighting layer. When the value of l is smaller, the feature map I 1 i,l is more important, and we attach it with a larger weight 2 L−l . After weighting operation, the features in first weighting layer can be denoted as Assume that the size of pooled feature map is m × n , and we divide the ith "image" T 1 i into m n blocks. Let R i = R i,1,1 , · · · , R i,x ,y , · · · , R i,m ,n be the partition of "image" T 1 i , where x and y denote the location of corresponding pooling region and 1 ≤ x' ≤ m', 1 ≤ y' ≤ n'. We perform mean pooling in each block as follows: where r i,x ,y denotes the pooled features at location (x , y ), and s i is the features of "image" T 1 i within pooling region R i,x ,y . The pooled filter responses r i = r i,1,1 , · · · , r i,x,y , · · · , r i,m ,n generated from pooling regions R i = R i,1,1 , · · · , R i,x ,y , · · · , R i,m ,n reduce the variance of the non-pooled representation.
Finally, the pooled features can be seen as input images of the pre-trained deep CNNs.

Multi-Stage Architecture
If a deeper architecture is found to be beneficial for the specific task, we can stack the above process to build a multi-stage architecture of the LPCANet or LQPCANet. As depicted in Figure 4, the two-stage LPCANet or two-stage LQPCANet contains two convolution layers (C1 and C2), two weighting layers (W1 and W2) and a pooling layer. The output of the last layer is fed to pre-trained deep CNNs to obtain semantic features for classification.
In Figure 4, the PCA filters bank V 1 and the QPCA filters bank V 1 , both of which contain L 1 filters, can be obtained from I. In layer C1, V 1 or V 1 is convolved with I to get the sets of feature maps I 1 . Further, these feature maps are weighted to obtain T 1 in layer W1, and the number of feature maps is decreased at the same time. The filters bank V 2 and V 2 , both of which contain L 2 filters, are generated from T 1 . Then, layer C2 executes convolutional operation, which uses kernel V 2 or V 2 to get the sets of feature maps I 2 . I 2 is further weighted as described in Section 3.3 to obtain T 2 in layer W2. Finally, we pool the feature maps T 2 to obtain the final feature maps r, which are generated as the input "images" of pre-trained deep CNNs. One or more additional stages can be stacked like C1-W1-C2-W2-C3 . . . , which can also be depicted in form of feature maps as I − I 1 − T 1 − I 2 − T 2 − · · · − r. What should be noted is that the whole process in the multi-stage architecture of LPCANet or LQPCANet is linear. That is to say, we do not change the basic structure of original images when we synthesize the spatial and spectral information of them.

Methodology of Enhancing the Generalization Power of Pre-Trained Deep CNNs for Remote Scene Classification
The difference between remote sensing images and daily nature images mainly lies in following two aspects. Firstly, they are usually different in spatial information. As shown in Figure 5, both of the two images denote airport and contain airplanes, runways and lawns. Nevertheless, the spatial information of them is very different in scale and direction. Moreover, compared with the daily optical image, there is more noise information in the remote sensing image that drawbacks the scene classification task. Secondly, they may be different in spectral information. Although the two images in Figure 6 both denote farmland, and they are almost same in spatial arrangement. The spectral channels of the left image are red-green-blue, and the spectral channels of the right image are green-red-infrared. As mentioned previously, to extract general features for CNNs pre-trained by ImageNet dataset, we should reduce the "distance" between daily nature images and remote sensing images. In this section, as illustrated in Figure 7, we propose two architectures to enhance the generalization power of pre-trained deep CNNs for remote scene classification. In the Experimental Section, we further evaluate their effectiveness.    In (b) Architecture (II), after represent the remote sensing imagery as pure quaternion, we use linear quaternion PCA network to further synthesize the spectral information of them.

Architecture (I): Synthesizing Spatial Information of Remote Sensing Images to Extract General Features for Pre-Trained Deep CNNs
In Architecture (I), firstly, after dividing the remote sensing image into a series of spectral channels, we apply LPCANet to the "gray" image in each spectral channel. As mentioned in Section 2.2, for remote scene classification task, LPCANet filters out irrelevant details and noise in the remote sensing image, and preserves the main structure of it at the same time. Secondly, for all the spectral channels, the output images of the LPCANets are rearranged into a synthesized image, which is input of the pre-trained deep CNN. By synthesizing spatial information of the remote sensing image, the "distance" between the daily nature image and the remote sensing image is reduced. Then, the pre-trained deep CNN is treated as a fixed feature extractor. In a feedforward way, it extracts a global feature representation of the synthesized image. Finally, with the global representation, we implement remote scene classification by a linear SVM classifier. In addition, we should consider some practical details as following: 1.
Thus far, almost all successful pre-trained structures of deep CNNs are based on the ImageNet dataset. This results in the constraint that the number of spectral channels of input images should be and only be three when we use the pre-trained deep CNNs to extract global representation from them. This constraint limits the application range of pre-trained deep CNNs and causes inevitable information loss when the number of input images' spectral channels is more than three.

2.
Data augmentation is a practical technique to improve the performance of deep CNNs by reducing overfitting in the training stage. However, in this paper, we use the pre-trained deep CNNs in a feedforward way without training on the remote sensing dataset. Because training a deep CNN on a small dataset helps little. Moreover, we usually cannot obtain the labels of remote sensing images in some case. Different from data augmentation, which enhances the generalization power of deep CNNs in supervised framework, LPCANet synthesizes the spatial information of remote sensing images and enhances the generalization power of pre-trained deep CNNs in an unsupervised manner.

3.
Compared with other remote sensing images such as SAR images, we prefer to apply Architecture (I) to optical remote sensing images. Because the spectral channels of daily natural images in the ImageNet dataset and optical remote sensing images in the target dataset are both red-green-blue, and the "distance" between them is relatively small.

Architecture (II): Further, Synthesizing Spectral Information of Remote Sensing Images to Extract General Features for Pre-Trained Deep CNNs
As discussed above, in the condition that the spectral information of remote sensing images is different from that of images in ImageNet dataset, the "distance" of spectral information between them is relatively large, and the performance of remote scene classification fades evidently when we directly transfer pre-trained deep CNNs to remote scene classification. LPCANet in Architecture (I) can only synthesize spatial information of remote sensing images in each spectral channel. It cannot handle the difference of spectral information between the source and target datasets. Therefore, inspired by quaternion algebra and the relationship of elements in quaternion representation, we represent remote sensing images in quaternion domain, and design the LQPCANet to synthesize spectral information of them. Derive from LPCANet, LQPCANet in Architecture (II) further reduces the "distance" between source dataset and target dataset, and enhances the generalization power of pre-trained deep CNNs for remote scene classification. Firstly, remote sensing images are represented in the form of pure quaternion. Secondly, they are pre-processed by LQPCANet. Then, the synthesized images are put into the pre-trained deep CNN to obtain global feature representation, which is finally used to perform the task of remote scene classification with a linear SVM classifier. The practical details of Architecture (II) are listed as following:

1.
Considering the constraint of the number of spectral channels that is discussed in Section 2.3.1, we should also obey this constraint in Architecture (II). Because the number of spectral channels of input images is fixed as three, in any case we apply pre-trained deep CNNs to extract global representation from them. Thus, the pure quaternion that contains three imaginary units is used to represent remote sensing images in practice.

2.
LQPCANet processes the pure quaternion representation of remote sensing images, rearranges the order of their spectral channels, and only maintains the relative relationship of them. Therefore, there is not some distinct spectral channel that we should represent it with some corresponding imaginary unit, when we transform the remote sensing images into pure quaternion form.

Experiments and Results
The main objective of this paper is to evaluate the two proposed architectures in enhancing the generalization power of deep pre-trained CNNs for remote scene classification. Therefore, we organize the experiments for Architecture (I) and Architecture (II), respectively, with various deep pre-trained CNNs and various remote sensing datasets.

Experimental Setup
In this section, we carry out a number of experiments based on Architecture (I) and Architecture (II) respectively. To evaluate their effectiveness in enhancing the generalization power of deep pre-trained CNNs for remote scene classification, we conduct experiments on three remote sensing datasets. These three datasets are different in spatial and spectral information. We compare the performance of our proposed framework with the state-of-the-art results in these three datasets. We must note that except learning the classifier, all the experiments based on Architecture (I) and Architecture (II) are unsupervised.
The three publicly available datasets used in our experiments are as follows: 1 UC Merced Land Use Dataset (http://vision.ucmerced.edu/datasets/landuse.html). Derived from United States Geological Survey (USGS) National Map, this dataset contains 2100 aerial scene images with 256 × 256 pixels, which are manually labeled as 21 land use classes, 100 for each class. Figure 8 shows one example image for each class. As shown in Figure 8, this dataset presents very small inter-class diversity among some categories, such as "dense residential", "medium residential" and "sparse residential" . More examples and more information are available in [38]. 2 WHU-RS Dataset (http://www.tsi.enst.fr/~xia/satellite_image_project.html). Collected from Google Earth, this dataset is composed of 950 aerial scene images with 600 × 600 pixels, which are uniformly distributed in 19 scene classes, 50 for each class. The example images for each class are shown in Figure 9. We can see that images in both this dataset and UC Merced dataset are optical images (RGB color space). They are same in spectral information. However, compared with the images in UC Merced dataset, images in this dataset contain more detail information in space. The variation of scale and resolution of objects in a wide range within the images makes this dataset more complicated than the UC Merced dataset. 3 Brazilian Coffee Scenes Dataset (www.patreo.dcc.ufmg.br/downloads/brazilian-coffee-dataset/). Taken by the SPOT sensor in the green, red, and near-infrared bands, over four counties in the State of Minas Gerais, Brazil, this dataset is released in 2015, and includes over 50,000 remote sensing images with 64 × 64 pixels, which are labeled as coffee (1438) non-coffee (36577) or mixed (12989) [25]. Figure 10 shows three example images for each of the coffee and non-coffee classes in false colors. To provide a balanced dataset for the experiments, 1438 images of both coffee and non-coffee classes are picked out, while images of mixed class are all discarded. Note that this dataset is very different from the former two datasets. Images in this dataset are not optical (green-red-infrared instead of red-green-blue).
Remote Sens. 2017, 9, 225 13 of 26 (a)        Following the same experimental protocol of very recent researches [22,25,26], we implement our experiments with five-fold cross-validation. Considering the UC Merced dataset, each of the five folds contains 420 images. As to the WHU-RS dataset, each of the five folds has 190 images. For the Brazilian Coffee Scenes dataset, four folds have 600 images each and the fifth has 476 images. Then, we carry out results in terms of average accuracy and standard deviation among the five folds. On the other hand, we use five well-known pre-trained deep CNNs (AlexNet [15], CaffeNet [17], VGG-VD16 [18], GoogLeNet [28], and ResNet [29]), described in Section 2.1, to test the effectiveness of our proposed Architecture (I) and Architecture (II) in the experiments. As we analyzed before, the operations in both LPCANet (as well as LQPCANet) and pre-trained deep CNN are unsupervised, and all the experiments are in unsupervised framework except learning the classifier.

Experimental Results of Architecture (I)
We evaluate Architecture (I) in enhancing the generalization power of the five well-known pre-trained deep CNNs for remote scene classification. In Architecture (I), we consider a shallow LPCANet that just has one-stage network. For the LPCANet, we set the PCA filter size as k 1 = k 2 = 8, the number of filters as L = 8, and the pooling range as 8 × 8 without overlapping for local features. The PCA filter banks require that k 1 k 2 ≥ L. Note that a larger range for pooling operation provides greater translation invariance in the extracted features r. Then, with nearest-neighbor interpolation algorithm, we use the function of "imresize" in Matlab to resize the pooled features map r to 227 × 227 for AlexNet and CaffeNet, and 224 × 224 for VGG-VD16, GoogLeNet and ResNet. Finally, we use a linear SVM as classifier, and implement experiments on the three former proposed remote sensing datasets. These datasets are different in spatial and spectral information in order to test the effectiveness of Architecture (I) in different conditions. Remote sensing images in UC Merced and WHU-RS datasets are both optical. Thus, they are same in spectral information with these images in ImageNet dataset that used to pre-train these deep CNNs. Architecture (I) is mainly designed for this case, and we carry out most experiments for this case. On the other hand, remote sensing images in the Brazilian Coffee Scenes dataset are not optical (green-red-infrared). In this case, the spectral information between source and target datasets is different. We briefly introduce the experiment results of Architecture (I) on this dataset. Architecture (II) is mainly designed for this case, and we will discuss it in Section 3.3 in detail.
With various pre-trained deep CNN models and remote sensing datasets, the remote scene classification performances are shown in Table 1. In Table 1, Ac and SD denote accuracy and standard deviation, respectively. For better comparison, we further show the accuracy of remote scene classification on UC Merced and WHU-RS datasets in Figure 11. Table 1. Remote scene classification results of five well-known pre-trained deep CNNs on three different remote sensing datasets.  In the condition of Off-the-shelf, pre-trained deep CNNs are directly used as feature extractors in an unsupervised manner. By removing the last fully-connected layer, the rest parts of pre-trained deep CNNs extract high dimensional feature vectors of remote sensing images. These feature vectors are considered as final image representation that followed by a linear SVM classifier. In fact, this framework almost achieves the best performance to date on optical remote sensing datasets [26]. Compared with training deep CNNs with remote sensing images from scratch, transferring pretrained deep CNNs for remote scene classification shows obvious advantages [22]. Because limited training data of remote sensing dataset brings overfitting seriously, and training from scratch cannot make full use of the deep architecture.

Ac (%) SD Ac (%) SD Ac (%) SD Ac (%) SD Ac (%) SD Ac (%) SD
However, in Table 1 and Figure 11, we can see that the performances of AlexNet, CaffeNet, VGG-VD16 and GoogLeNet are almost same. There is obvious bottleneck for directly transferring pretrained deep CNNs to optical remote scene classification. Moreover, the experiment results overturn our intuition that these CNNs with deeper structure or sophisticated units perform better. In fact, GoogLeNet takes no obvious advantage over AlexNet and CaffeNet, and VGG-VD16 even obtains worse performance than AlexNet. The reason may be that the parameters in deeper layers are more specific for the dataset (ImageNet dataset in this paper) used to pre-train the deep CNNs, and these parameters lack generalization power. In addition, to our surprise, the most successful deep CNNs to date, ResNets fail to obtain a good experiment result, no matter their layers are 50, 101 or 152. This phenomenon indicates that not all successful deep CNNs pre-trained on ImageNet dataset are suitable for transferring to remote scene classification. In ResNets, shortcut connections bring fewer parameters and make the network much easier to optimize. At the same time, the directly connection between input and output brings poor generalization ability when we transfer them for other tasks.
By extracting general features from LPCANet, we propose Architecture (I) to obtain better performance when transferring pre-trained deep CNNs for remote scene classification. As we can see in Table 1 and Figure 11, the remote scene classification accuracy breaks the bottleneck and increases in condition of Architecture (I). Taking a close look into the experiment results, we find that compared with Off-the-shelf, the margin increased by Architecture (I) becomes larger when we apply it to deeper or more sophisticated CNNs such as VGG-VD16 and GoogLeNet. This gives evidence to the conclusion that Architecture (I) can enhance the generalization power of pre-trained deep CNNs and make better use of them. In addition, smaller standard deviation of classification accuracy in condition of Architecture (I) suggests that Architecture (I) is more stable when transferring pretrained deep CNNs for remote scene classification. Taking pre-trained CaffeNet for example, Figure  12 shows the detail changes of an optical remote sensing image in condition of Off-the-shelf and Architecture (I). In the condition of Off-the-shelf, pre-trained deep CNNs are directly used as feature extractors in an unsupervised manner. By removing the last fully-connected layer, the rest parts of pre-trained deep CNNs extract high dimensional feature vectors of remote sensing images. These feature vectors are considered as final image representation that followed by a linear SVM classifier. In fact, this framework almost achieves the best performance to date on optical remote sensing datasets [26]. Compared with training deep CNNs with remote sensing images from scratch, transferring pre-trained deep CNNs for remote scene classification shows obvious advantages [22]. Because limited training data of remote sensing dataset brings overfitting seriously, and training from scratch cannot make full use of the deep architecture.
However, in Table 1 and Figure 11, we can see that the performances of AlexNet, CaffeNet, VGG-VD16 and GoogLeNet are almost same. There is obvious bottleneck for directly transferring pre-trained deep CNNs to optical remote scene classification. Moreover, the experiment results overturn our intuition that these CNNs with deeper structure or sophisticated units perform better. In fact, GoogLeNet takes no obvious advantage over AlexNet and CaffeNet, and VGG-VD16 even obtains worse performance than AlexNet. The reason may be that the parameters in deeper layers are more specific for the dataset (ImageNet dataset in this paper) used to pre-train the deep CNNs, and these parameters lack generalization power. In addition, to our surprise, the most successful deep CNNs to date, ResNets fail to obtain a good experiment result, no matter their layers are 50, 101 or 152. This phenomenon indicates that not all successful deep CNNs pre-trained on ImageNet dataset are suitable for transferring to remote scene classification. In ResNets, shortcut connections bring fewer parameters and make the network much easier to optimize. At the same time, the directly connection between input and output brings poor generalization ability when we transfer them for other tasks.
By extracting general features from LPCANet, we propose Architecture (I) to obtain better performance when transferring pre-trained deep CNNs for remote scene classification. As we can see in Table 1 and Figure 11, the remote scene classification accuracy breaks the bottleneck and increases in condition of Architecture (I). Taking a close look into the experiment results, we find that compared with Off-the-shelf, the margin increased by Architecture (I) becomes larger when we apply it to deeper or more sophisticated CNNs such as VGG-VD16 and GoogLeNet. This gives evidence to the conclusion that Architecture (I) can enhance the generalization power of pre-trained deep CNNs and make better use of them. In addition, smaller standard deviation of classification accuracy in condition of Architecture (I) suggests that Architecture (I) is more stable when transferring pre-trained deep CNNs for remote scene classification. Taking pre-trained CaffeNet for example, Figure 12 shows the detail changes of an optical remote sensing image in condition of Off-the-shelf and Architecture (I). Abbreviated as "conv" and "fc", reconstructions of convolutional feature maps in the former network layers and that of fully connected layers are shown in Figure 12. Figure 12a shows that the representations of convolutional layers are still photographically similar with the remote sensing image to some extent, although they becomes fuzzier and fuzzier from "conv1" to "conv5". In addition, the fully connected layers rearrange the information from lower layers to generate representations that are more abstract. They compose of parts (e.g., the wings of airplanes) similar but not identical to the ones found in the original image. In Figure 12b, LPCANet filters out irrelevant details and noise in remote scenes, and preserve the main structure of them at the same time. Based on PCA filters, convolutional operation and weighting operation retain the mainly discrimination ability of remote scenes. On the other hand, the pooling operation enhances the inter-class invariance. As a result, the synthesized image maintains the semantic features of remote scenes with less noise, and become less different with daily optical images in spatial information. Comparing the reconstructed images in fully connected layers in Figure 12a,b, we find that there are more parts in various positions and scales in Figure 12b. Moreover, like wings of airplane, these parts are more discriminative with less blurs. This experiment result further confirms that Architecture (I) can enhance the generalization power of pre-trained deep CNNs and improve their performance for remote sensing images.
To intuitively reflect the distribution of global features learned in condition of Off-the-shelf and Architecture (I), we use the t-SNE algorithm [40,41] to visualize these high-dimensional global features by giving each datapoint a location in a 2-D map. For both conditions, the degree of perplexity and the number of training iterations in the t-SNE algorithm are set as 30 and 1000. We show these 2-D embedding points with different colors corresponding to their actual scene categories. Figure 13 reveals the separability of global features learned by pre-trained CaffeNet when we apply experiment on UC Merced dataset in above two conditions. Notably, the 2-D features from both of the two conditions naturally tend to form clusters. In addition, compared with Off-the shelf, Architecture (I) leads to better separability of global features. Abbreviated as "conv" and "fc", reconstructions of convolutional feature maps in the former network layers and that of fully connected layers are shown in Figure 12. Figure 12a shows that the representations of convolutional layers are still photographically similar with the remote sensing image to some extent, although they becomes fuzzier and fuzzier from "conv1" to "conv5". In addition, the fully connected layers rearrange the information from lower layers to generate representations that are more abstract. They compose of parts (e.g., the wings of airplanes) similar but not identical to the ones found in the original image. In Figure 12b, LPCANet filters out irrelevant details and noise in remote scenes, and preserve the main structure of them at the same time. Based on PCA filters, convolutional operation and weighting operation retain the mainly discrimination ability of remote scenes. On the other hand, the pooling operation enhances the inter-class invariance. As a result, the synthesized image maintains the semantic features of remote scenes with less noise, and become less different with daily optical images in spatial information. Comparing the reconstructed images in fully connected layers in Figure 12a,b, we find that there are more parts in various positions and scales in Figure 12b. Moreover, like wings of airplane, these parts are more discriminative with less blurs. This experiment result further confirms that Architecture (I) can enhance the generalization power of pre-trained deep CNNs and improve their performance for remote sensing images.
To intuitively reflect the distribution of global features learned in condition of Off-the-shelf and Architecture (I), we use the t-SNE algorithm [40,41] to visualize these high-dimensional global features by giving each datapoint a location in a 2-D map. For both conditions, the degree of perplexity and the number of training iterations in the t-SNE algorithm are set as 30 and 1000. We show these 2-D embedding points with different colors corresponding to their actual scene categories. Figure 13 reveals the separability of global features learned by pre-trained CaffeNet when we apply experiment on UC Merced dataset in above two conditions. Notably, the 2-D features from both of the two conditions naturally tend to form clusters. In addition, compared with Off-the shelf, Architecture (I) leads to better separability of global features. Data augmentation is a practical technique for training an effective deep CNN. However, when we transfer a pre-trained deep CNN for remote scene classification, we treat the pre-trained deep CNN as a fixed feature extractor and do not change the parameters in it. Then, all the extracted features are used to train the classifier. Therefore, data augmentation just affects the classifier, and has no impact on the parameters in pre-trained deep CNNs. For two typical classifiers, we test data augmentation in framework of Architecture (I) on UC Merced dataset by simply rotating the original remote sensing images by 90 degrees, 180 degrees and 270 degrees. We find that the technique of data augmentation indeed works. However, it contributes little as shown in Table 2. To further verify the effectiveness of LPCANet in Architecture (I), in Figure 14, we directly apply PCA algorithm to every single image in UC Merced dataset before the block of pre-trained deep CNN. This simple architecture, called Architecture (S), is designed for comparison. Without augmentation, Table 3 shows the experiment results on UC Merced dataset. We can see that the classification accuracy fades in condition of Architecture (S) compared with the conditions of Architecture (I) and Off-the-shelf. This gives evidence that simply applying PCA algorithm to remote sensing images may lose some discriminative spatial information, and cannot obtain general features for pre-trained deep CNNs. The experiment results further confirm the effectiveness of our proposed Architecture (I). Data augmentation is a practical technique for training an effective deep CNN. However, when we transfer a pre-trained deep CNN for remote scene classification, we treat the pre-trained deep CNN as a fixed feature extractor and do not change the parameters in it. Then, all the extracted features are used to train the classifier. Therefore, data augmentation just affects the classifier, and has no impact on the parameters in pre-trained deep CNNs. For two typical classifiers, we test data augmentation in framework of Architecture (I) on UC Merced dataset by simply rotating the original remote sensing images by 90 degrees, 180 degrees and 270 degrees. We find that the technique of data augmentation indeed works. However, it contributes little as shown in Table 2. To further verify the effectiveness of LPCANet in Architecture (I), in Figure 14, we directly apply PCA algorithm to every single image in UC Merced dataset before the block of pre-trained deep CNN. This simple architecture, called Architecture (S), is designed for comparison. Without augmentation, Table 3 shows the experiment results on UC Merced dataset. We can see that the classification accuracy fades in condition of Architecture (S) compared with the conditions of Architecture (I) and Off-the-shelf. This gives evidence that simply applying PCA algorithm to remote sensing images may lose some discriminative spatial information, and cannot obtain general features for pre-trained deep CNNs. The experiment results further confirm the effectiveness of our proposed Architecture (I).  Various state-of-the-art methods have been proposed recently for remote scene classification, and most of them have been tested on the UC Merced dataset, following the same experimental protocol, with five-fold cross validation. Thus, in Table 4 we compare our best result achieved via Architecture (I) with these methods on the UC Merced dataset. With straightforward and simple framework, our proposed Architecture (I) outperforms all the methods with a minimum gap of almost 1.5%. We must note that our proposed method just provides basic framework to directly transfer pre-trained deep CNNs for remote scene classification in an unsupervised manner, and do not use target dataset to train the parameters in the pre-trained CNNs. Therefore, our proposed method achieves no better result than the GoogLeNet + Fine-tune approach in [22]. The effectiveness of fine-tuning approach is much dependent on the amount of images in remote sensing dataset, and the computation time of it is more demanding compared with our proposed Architecture (I). In fact, in Table 1, we can see that, with pre-trained CaffeNet in Architecture (I), the experiment result on UC Merced dataset has almost achieved the performance of fine-tuning approach in [22]. In addition, if the task of remote scene classification permits sufficient computation time, with sufficient remote sensing images, we can further fine-tune the parameters of pre-trained deep CNNs in Architecture (I).
In the Brazilian Coffee Scenes dataset, remote sensing images are not optical (green-redinfrared). In addition, as shown in Figure 10, the spatial information of these images is very simple. In Table 1, the relatively poor performance comes from the difference in spectral information when we transferring pre-trained deep CNNs to remote scene classification on this dataset. As we analyzed before, LPCANet in Architecture (I) changes no spectral information of remote sensing images. In addition, when spatial information of remote sensing images is simple, the effect of LPCANet in Architecture (I) is weakened in decreasing the "distance" between target dataset and source dataset. For this dataset, experiment results in Figure 15 indicate that Architecture (I) helps little and even make things worse when the spectral information of remote sensing images is very different from these images in source dataset.  Various state-of-the-art methods have been proposed recently for remote scene classification, and most of them have been tested on the UC Merced dataset, following the same experimental protocol, with five-fold cross validation. Thus, in Table 4 we compare our best result achieved via Architecture (I) with these methods on the UC Merced dataset. With straightforward and simple framework, our proposed Architecture (I) outperforms all the methods with a minimum gap of almost 1.5%. We must note that our proposed method just provides basic framework to directly transfer pre-trained deep CNNs for remote scene classification in an unsupervised manner, and do not use target dataset to train the parameters in the pre-trained CNNs. Therefore, our proposed method achieves no better result than the GoogLeNet + Fine-tune approach in [22]. The effectiveness of fine-tuning approach is much dependent on the amount of images in remote sensing dataset, and the computation time of it is more demanding compared with our proposed Architecture (I). In fact, in Table 1, we can see that, with pre-trained CaffeNet in Architecture (I), the experiment result on UC Merced dataset has almost achieved the performance of fine-tuning approach in [22]. In addition, if the task of remote scene classification permits sufficient computation time, with sufficient remote sensing images, we can further fine-tune the parameters of pre-trained deep CNNs in Architecture (I).
In the Brazilian Coffee Scenes dataset, remote sensing images are not optical (green-red-infrared). In addition, as shown in Figure 10, the spatial information of these images is very simple. In Table 1, the relatively poor performance comes from the difference in spectral information when we transferring pre-trained deep CNNs to remote scene classification on this dataset. As we analyzed before, LPCANet in Architecture (I) changes no spectral information of remote sensing images. In addition, when spatial information of remote sensing images is simple, the effect of LPCANet in Architecture (I) is weakened in decreasing the "distance" between target dataset and source dataset. For this dataset, experiment results in Figure 15 indicate that Architecture (I) helps little and even make things worse when the spectral information of remote sensing images is very different from these images in source dataset.

Experimental Results of Architecture (II)
As discussed before, Architecture (I) obtains poor performance on Brazilian Coffee Scenes dataset, in which the spectral information of remote sensing images is very different from that of images in ImageNet dataset used to pre-train the deep CNNs. Therefore, we propose Architecture (II) to handle the difference of spectral information between source and target datasets, and further enhance the generalization power of pre-trained deep CNNs for remote scene classification. With the same experiment parameters in Section 3.2, we report remote scene classification results in Table 5 for Architecture (II), Architecture (I) and Off-the-shelf on the three proposed remote sensing datasets.

Experimental Results of Architecture (II)
As discussed before, Architecture (I) obtains poor performance on Brazilian Coffee Scenes dataset, in which the spectral information of remote sensing images is very different from that of images in ImageNet dataset used to pre-train the deep CNNs. Therefore, we propose Architecture (II) to handle the difference of spectral information between source and target datasets, and further enhance the generalization power of pre-trained deep CNNs for remote scene classification. With the same experiment parameters in Section 3.2, we report remote scene classification results in Table 5 for Architecture (II), Architecture (I) and Off-the-shelf on the three proposed remote sensing datasets. In Table 5, Ar(II), Ar(I) and OTS denote Architecture (II), Architecture (I) and Off-the-shelf respectively. From the experiment results, we find that Architecture (II) is superior to Architecture (I) and Off-the-shelf with a substantial gain on Brazilian Coffee Scenes dataset for all the pre-trained deep CNNs. On the other hand, Architecture (II) is slightly inferior to Architecture (I) on the UC Merced and WHU-RS datasets. Nevertheless, the remote scene classification accuracy of Architecture (II) is higher than that of Off-the-shelf in any case. These experiment results confirm what we discussed in Section 2.3.2. LQPCANet in Architecture (II) rearranges the spectral information of remote sensing images in Brazilian Coffee Scenes dataset and reduce the "distance" between source dataset and target dataset in the transferring process. As a result, Architecture (II) makes better use of the high-level features in pre-trained deep CNNs and enhances their generalization power when the spectral information is different between source and target datasets.
Taking a close look into Figure 10, we observe that remote sensing images in Brazilian Coffee Scenes dataset are composed of simple edges. Namely, the spatial information of these images is very simple, and we should pay more attention to the discrimination of inter-class variability instead of the invariance of intra-class variability. On the contrary, as shown in Figures 8 and 9, the invariance of intra-class variability is more important for remote scene classification on UC Merced and WHU-RS datasets. Therefore, we further test the effectiveness of pooling operation in LQPCANet in Architecture (II). With different pooling ranges in Architecture (II), the remote scene classification accuracies on different datasets are reported in Figure 16. These pooling ranges are set according to the size of images in specific remote sensing dataset to guarantee the non-overlapping pooling operation. In addition, when we apply different pooling ranges in the experiments, the difference of classification accuracies is not obvious. Thus, in the condition of each pooling range, we iterate the experiment 10 times and show the average result in Figure 16.  In Table 5, Ar(II), Ar(I) and OTS denote Architecture (II), Architecture (I) and Off-the-shelf respectively. From the experiment results, we find that Architecture (II) is superior to Architecture (I) and Off-the-shelf with a substantial gain on Brazilian Coffee Scenes dataset for all the pre-trained deep CNNs. On the other hand, Architecture (II) is slightly inferior to Architecture (I) on the UC Merced and WHU-RS datasets. Nevertheless, the remote scene classification accuracy of Architecture (II) is higher than that of Off-the-shelf in any case. These experiment results confirm what we discussed in Section 2.3.2. LQPCANet in Architecture (II) rearranges the spectral information of remote sensing images in Brazilian Coffee Scenes dataset and reduce the "distance" between source dataset and target dataset in the transferring process. As a result, Architecture (II) makes better use of the high-level features in pre-trained deep CNNs and enhances their generalization power when the spectral information is different between source and target datasets.
Taking a close look into Figure 10, we observe that remote sensing images in Brazilian Coffee Scenes dataset are composed of simple edges. Namely, the spatial information of these images is very simple, and we should pay more attention to the discrimination of inter-class variability instead of the invariance of intra-class variability. On the contrary, as shown in Figures 8 and 9, the invariance of intra-class variability is more important for remote scene classification on UC Merced and WHU-RS datasets. Therefore, we further test the effectiveness of pooling operation in LQPCANet in Architecture (II). With different pooling ranges in Architecture (II), the remote scene classification accuracies on different datasets are reported in Figure 16. These pooling ranges are set according to the size of images in specific remote sensing dataset to guarantee the non-overlapping pooling operation. In addition, when we apply different pooling ranges in the experiments, the difference of classification accuracies is not obvious. Thus, in the condition of each pooling range, we iterate the experiment 10 times and show the average result in Figure 16.  Figure 16 shows that the pooling range in LQPCANet may affect the performance of Architecture (II). When the remote sensing images are composed of sophisticated objects, a relatively larger pooling range in Architecture (II) enhances the invariance of intra-class variability and brings better performance in remote scene classification. On the contrary, a relatively smaller pooling range contributes more when the remote sensing images consist of simple edges or blobs such as images in    Figure 16 shows that the pooling range in LQPCANet may affect the performance of Architecture (II). When the remote sensing images are composed of sophisticated objects, a relatively larger pooling range in Architecture (II) enhances the invariance of intra-class variability and brings better performance in remote scene classification. On the contrary, a relatively smaller pooling range contributes more when the remote sensing images consist of simple edges or blobs such as images in Brazilian Coffee Scenes dataset. Moreover, inspired by Figure 16, we may prefer to design a relatively larger pooling range in the LPCANet when we apply Architecture (I) to UC Merced and WHU-RS datasets in Section 3.2.
Furthermore, we visualize the global representations of remote sensing images in Brazilian Coffee Scenes dataset. These global representations are encoded via pre-trained CaffeNet in Architecture (II), Architecture (I) and Off-the-shelf respectively. High-dimensional image features are embedded on a 2-D space by using the t-SNE algorithm [40,41]. For all conditions, the degree of perplexity and the number of training iterations in t-SNE algorithm are set as 30 and 1000. As shown in Figure 17, with same pre-trained deep CNN, Architecture (II) leads to the best separability of global representations in the case that spectral information is different between source and target datasets. As a result, Architecture (II) enhances the generalization power of pre-trained deep CNN and brings better performance for remote scene classification on Brazilian Coffee Scenes dataset. Brazilian Coffee Scenes dataset. Moreover, inspired by Figure 16, we may prefer to design a relatively larger pooling range in the LPCANet when we apply Architecture (I) to UC Merced and WHU-RS datasets in Section 3.2. Furthermore, we visualize the global representations of remote sensing images in Brazilian Coffee Scenes dataset. These global representations are encoded via pre-trained CaffeNet in Architecture (II), Architecture (I) and Off-the-shelf respectively. High-dimensional image features are embedded on a 2-D space by using the t-SNE algorithm [40,41]. For all conditions, the degree of perplexity and the number of training iterations in t-SNE algorithm are set as 30 and 1000. As shown in Figure 17, with same pre-trained deep CNN, Architecture (II) leads to the best separability of global representations in the case that spectral information is different between source and target datasets. As a result, Architecture (II) enhances the generalization power of pre-trained deep CNN and brings better performance for remote scene classification on Brazilian Coffee Scenes dataset. On the Brazilian Coffee Scenes dataset, we further compare the performance of Architecture (II) with several well-known methods. The comparison is relatively insufficient as shown in Table 6. Because this dataset is newly released in 2015 [25], and there are not sufficient researches on it. We find that our proposed Architecture (II) performs well without training or fine-tuning the parameters in pre-trained CNNs. In [22], training deep CNNs from scratch with Brazilian Coffee Scenes dataset achieves classification accuracy up to 91.83%. However, training a deep CNN from scratch is very time-consuming and this depend much on the scale of target dataset. Comparing the method of directly transferring pre-trained GoogLeNet for remote scene classification (84.02%) with Architecture (II) that contains the same pre-trained GoogLeNet (88.46%), we give evidence that Architecture (II) indeed enhances the generalization power of pre-trained CNNs for remote scene classification when the spectral information is different between source and target datasets.

Discussion
From the extensive experiments above, our two proposed architectures, which contain LPCANet  [40,41] is used to visualize the high-dimensional representations.
On the Brazilian Coffee Scenes dataset, we further compare the performance of Architecture (II) with several well-known methods. The comparison is relatively insufficient as shown in Table 6. Because this dataset is newly released in 2015 [25], and there are not sufficient researches on it. We find that our proposed Architecture (II) performs well without training or fine-tuning the parameters in pre-trained CNNs. In [22], training deep CNNs from scratch with Brazilian Coffee Scenes dataset achieves classification accuracy up to 91.83%. However, training a deep CNN from scratch is very time-consuming and this depend much on the scale of target dataset. Comparing the method of directly transferring pre-trained GoogLeNet for remote scene classification (84.02%) with Architecture (II) that contains the same pre-trained GoogLeNet (88.46%), we give evidence that Architecture (II) indeed enhances the generalization power of pre-trained CNNs for remote scene classification when the spectral information is different between source and target datasets.

Discussion
From the extensive experiments above, our two proposed architectures, which contain LPCANet and LQPCANet, respectively, have been proven to be effective for remote scene classification. As discussed in [22,25,26], deep CNNs pre-trained on everyday objects can be successfully transferred to remote sensing domain. To some degree, this transferring strategy achieves the state-of-the-art performance for remote scene classification. The major factor that affects this transferring process is proven to be the generalization power of pre-trained deep CNNs [59][60][61]. However, the difference of spatial and spectral information between source and target datasets brings bottleneck for the generalization power of pre-trained deep CNNs as shown in our experiments. Based on transferring pre-trained deep CNNs, Castelluccio et al. [22,25] further improve the performance of remote scene classification by fine-tuning and feature fusing respectively. Nevertheless, they do no efforts about the remote sensing images for the transferring process. In our proposed Architecture (I), the LPCANet is used to filter out noise and enhance the edges in remote sensing images. On the other hand, LQPCANet in Architecture (II) further rearranges the relative relationship of spectral channels for remote sensing images. The two proposed architectures in our paper enhance the generalization power of pre-trained deep CNNs for remote scene classification and break the bottleneck mentioned above. Moreover, our method can be seen as a starting point, and be further improved by fine-tuning or feature fusing. Specifically, several practical observations from the experiments and some limitations of our study are summarized as follows:

•
In Tables 1 and 5, we can see that the performances of pre-trained AlexNet, CaffeNet, VGG-VD16 and GoogLeNet are almost same in remote scene classification in condition of Off-the-shelf. There is obvious bottleneck for directly transferring pre-trained deep CNNs to the task of remote scene classification. Our proposed two architectures improve the performance of pre-trained CNNs in an unsupervised manner and provide a better starting point for further method (such as fine-tuning and feature fusing) to get better performance for remote scene classification. • To our surprise, the most successful deep CNNs to date, ResNets, fail to obtain good experiment result when we transfer it for remote scene classification, no matter their layers are 50, 101 or 152. This phenomenon indicates that not all successful deep CNNs are suitable for transferring to the task of remote scene classification.

•
The selection of our two proposed architectures depends on the target dataset in the transferring process, namely the remote sensing dataset when we transfer pre-trained deep CNNs for remote scene classification. When the spectral information of source and target datasets are the same, we use Architecture (I), and we prefer to Architecture (II) when their spectral information is different. • Compared with directly transferring pre-trained deep CNNs for remote scene classification, our method provides a new way to optimize the transferring process. When we transfer any successful deep CNN explored in future for remote scene classification, we can make it a step further with our proposed method.

•
The transferring strategy in our paper is limited by the spectral channels of input images for the deep CNNs pre-trained by everyday optical images. For remote sensing images whose spectral channels are more than three, their spectral dimensions must be reduces to three to fit the pre-trained deep CNNs transferred to them. With no doubt, this operation brings spectral information loss.

•
In the remote sensing field, the scale of remote sensing datasets will be larger and larger. On the other hand, the structure of deep CNN will be optimized, and the parameters in it will be less and less. Therefore, in our proposed framework we could get more and more useful information from remote sensing datasets, obtain better generalization power of pre-trained deep CNNs and run into less overfitting.
Based on our study, the future research directions of transferring pre-trained deep CNNs for remote scene classification may be as follows. Firstly, different from empirically choosing parameters in LPCANet and LQPCANet in this paper, how to regulate their parameters to obtain better performance remains to be learned. Secondly, instead of placing LPCANet or LQPCANet before pre-trained deep CNNs, would replacing some convolutional layers in pre-trained deep CNNs with LPCANet or LQPCANet work? Finally, as we discussed above, when transferring the most successful ResNet for remote scene classification, it does not works as we expected. Thus, we should find the proper structure of deep CNNs that are more suitable to transfer to remote sensing field.

Conclusions
In this paper, we have presented a framework to enhance the generalization power of pre-trained deep CNNs for remote scene classification. To handle the difference of spatial and spectral information between remote sensing images and images in pre-training dataset, two promising architectures are proposed to reduce the "distance" between them.
The two main conclusions of this work are that: (1) For the difference in spatial information between remote sensing dataset and pre-training dataset, Architecture (I) enhances the generalization power of pre-trained deep CNNs in it and achieve better performance in remote scene classification. Linear PCA network in Architecture (I) synthesize spatial information of remote sensing images in each spectral channel, and reduces the spatial "distance" between source and target datasets; (2) When remote sensing dataset and the source dataset are different in spectral information, remote sensing images are represented as pure quaternion in linear quaternion PCA network, which further synthesizes spectral information of them. As a result, Architecture (II) enhances the generalization power of the pre-trained deep CNN in it, and improves the classification accuracy of remote scenes. Experiments on three datasets with different properties have provided insightful information. Architecture (I) outperforms the Off-the-shelf method with a gain up to 1.37% on UC Merced dataset and 1.46% on WHU-RS dataset. Architecture (II) outperforms the Off-the-shelf method with a gain up to 4.4% on Brazilian Coffee Scenes dataset. Moreover, the effect of our proposed architectures becomes more evident when the "distance" between source and target datasets becomes larger.
We believe our proposed method in this work can serve as a good baseline for people to transfer pre-trained deep CNNs to other remote sensing datasets with more advanced processing components or more sophisticated structures.