Ensemble Learning Approaches Based on Covariance Pooling of CNN Features for High Resolution Remote Sensing Scene Classiﬁcation

: Remote sensing image scene classiﬁcation, which consists of labeling remote sensing images with a set of categories based on their content, has received remarkable attention for many applications such as land use mapping. Standard approaches are based on the multi-layer representation of ﬁrst-order convolutional neural network (CNN) features. However, second-order CNNs have recently been shown to outperform traditional ﬁrst-order CNNs for many computer vision tasks. Hence, the aim of this paper is to show the use of second-order statistics of CNN features for remote sensing scene classiﬁcation. This takes the form of covariance matrices computed locally or globally on the output of a CNN. However, these datapoints do not lie in an Euclidean space but a Riemannian manifold. To manipulate them, Euclidean tools are not adapted. Other metrics should be considered such as the log-Euclidean one. This consists of projecting the set of covariance matrices on a tangent space deﬁned at a reference point. In this tangent plane, which is a vector space, conventional machine learning algorithms can be considered, such as the Fisher vector encoding or SVM classiﬁer. Based on this log-Euclidean framework, we propose a novel transfer learning approach composed of two hybrid architectures based on covariance pooling of CNN features, the ﬁrst is local and the second is global. They rely on the extraction of features from models pre-trained on the ImageNet dataset processed with some machine learning algorithms. The ﬁrst hybrid architecture consists of an ensemble learning approach with the log-Euclidean Fisher vector encoding of region covariance matrices computed locally on the ﬁrst layers of a CNN. The second one concerns an ensemble learning approach based on the covariance pooling of CNN features extracted globally from the deepest layers. These two ensemble learning approaches are then combined together based on the strategy of the most diverse ensembles. For validation and comparison purposes, the proposed approach is tested on various challenging remote sensing datasets. Experimental results exhibit a signiﬁcant gain of approximately 2% in overall accuracy for the proposed approach compared to a similar state-of-the-art method based on covariance pooling of CNN features (on the UC Merced dataset).


Introduction
The aim of a supervised classification algorithm consists of labeling an image with the corresponding class according to its content. Conventional approaches are based on encoding handcrafted features with, for example, the bag of words model (BoW) [1], the vector of locally models, second-order representation is introduced only for the deepest layers. To overcome this issue, Gao et al. [39] have proposed the global second-order pooling (GSoP) convolutional networks which permit to introduce higher-order representation in earlier layers. Nevertheless, training such a deep CNN model from scratch requires a huge labeled training set. Recently, the remote sensing community has started to build large scale datasets that can serve as pre-training, such as the BigEarthNet composed by Sentinel-2 image patches [41]. However, for many practical applications, most of the remote sensing datasets are quite small.
Many authors have proposed several ideas to overcome this issue such as using a new kind of neural network called capsule network [42] which has the ability to work with a small amount of training data. Compared to convolutional neural network, capsule network allows to address the "Picasso problem" in image recognition, i.e., images that show the right components but have not the right spatial relationships. For example, for a face image, the location of the eye and ear are swapped. For our application of remote sensing scene classification, this is not critical. For instance, in an harbour scene, the location of the scene elements (boats, pontoon, . . . ) in the image is not so important. The key point is that the network is able to recognize them. Another effective solution for limited training set consists of transfer learning. In that case, CNN models are considered as feature extractors. Classically, deep CNN models pre-trained on the ImageNet dataset are used. Then, features are extracted from a single or multiple layers and processed with some machine learning algorithms. This technique has been proved to be efficient and permits outperforming traditional handcrafted feature-based methods [13]. In a recent paper, Pires de Lima et al. have shown that transfer learning strategies based on feature extraction are among the best approaches for remote sensing scene classification, especially for the dataset with a low number of training samples [43]. In this context, in order to the benefit of pre-trained deep neural networks and second-order representations, this work aims at proposing a novel ensemble learning approach based on covariance pooling of CNN features for remote sensing scene classification. It consists of a combination of two hybrid architectures exploiting second-order features. The former is based on the log-Euclidean Fisher vector encoding of region covariance matrices computed locally on the first layers of a CNN [28] and its extension to the use of an ensemble learning strategy to combine multiple classifiers. The latter concerns an ensemble learning approach based on the covariance pooling of CNN features extracted from deeper layers [44].
In summary, second-order representation (i.e., covariance pooling) has been shown to be useful for many signal and image processing tasks. Recently, in the remote sensing community, some works have shown interest in these second-order features for various remote sensing applications (e.g., remote sensing scene classification, texture recognition) [35,40,45,46]. Motivated by these works and the success of deep neural networks, we have recently proposed two hybrid transfer learning approaches based on covariance pooling of CNN features [28,44]. These two methods use either local or global second-order representation of CNN features. The main motivation of this journal paper is to unify these works by presenting a transfer learning approach which benefit of these approaches. The main contributions of the paper can be summarized as follows:

•
We propose a transfer learning approach, which efficiently combine local and global second-order representation of CNN features. For the local one, an ensemble learning extension of our log-Euclidean Fisher vector encoding of region covariance matrices [28] is introduced. For the global one, our covariance pooling of deepest CNN features is considered [44].

•
An ensemble learning approach based on the most diverse ensembles is proposed to combine these decisions and enhance the classification performance.

•
This transfer learning is validated on different labeled remote sensing datasets to illustrate its efficiency. Three are publicly available, namely UC Merced Land Use, SIRI-WHU and AID datasets. Two others are internal datasets, oyster racks and maritime pine forest datasets, which are manually labeled by thematic experts.
The paper is structured as follows. Since the second-order representation of CNN features is at the core of the paper, Section 2 gives the mathematical background for the log-Euclidean representation of a covariance matrix. Next, Section 3 introduces the proposed ensemble learning approach based on the log-Euclidean Fisher vector encoding of region covariance matrices. Then, Section 4 recalls our ensemble learning approach based on covariance pooling (ELCP) of CNN features. In order to combine these two methods, Section 5 presents the fusion scheme based on the most diverse ensembles. Next, Section 6 summarizes a series of experiments performed on remote sensing scene classification. And finally, Section 7 provides the main conclusions and perspectives of this work.

Log-Euclidean Framework for Second-Order Statistics of CNN Features
In the literature, second-order statistics have been proved to play an important role in the human visual recognition process [21]. In practice, the covariance matrix of handcrafted descriptors, textural or deep convolutional features is computed and integrated into the classification algorithm. Since covariance matrices are symmetric positive definite (SPD) matrices, they have a specific geometry, and standard Euclidean tools are not adapted. The present section aims at explaining the geometry of SPD matrices and classical metrics used to manipulate these data. In fact, these datapoints lie inside the cone of positive definite matrices that is a Riemannian manifold.
Therefore, applying standard Euclidean operations on covariance matrices, for instance, computing the Euclidean distance between two covariance matrices, may lead to undesirable results such as the swelling effect as observed in [47]. Many authors have raised the need of intrinsic tools to analyze SPD matrices [32,48]. As pointed out by Pennec et al., the log-Euclidean and the affine invariant Riemannian metrics enjoy desirable invariance properties compared to the Euclidean metric. The affine invariant Riemannian distance has the property of being invariant by affine transformations.
Even if the log-Euclidean metric does not yield full affine invariance, it is invariant by similarity (orthogonal transformation and scaling). The computations using this metric could be invariant with respect to a change of coordinates obtained by a similarity. From a practical point of view, Arsigny et al., have shown in [32] that affine invariant and log-Euclidean frameworks perform better than the Euclidean one for the interpolation and regularization of their synthetic and clinical 3D diffusion tensor magnetic resonance imaging (DT-MRI) data. This has the advantage of more accurately capturing the underlying scatter of the data points (that are covariance matrices) than is possible with methods that treat data points as elements in a vector space. For many applications, the log-Euclidean framework has shown competitive results compared to the affine invariant Riemannian one [31,32]. This log-Euclidean framework is considered in this paper for its efficiency and ease of use. The basic principle is the following. Each covariance matrix M n is mapped on the tangent space, as illustrated in Figure 1 that locally flattens the manifold via the tangent space approximation. This consists of projecting covariance matrices onto a common tangent space of this manifold at the reference point M re f via the log map operator [26,32,45] defined as: T M re f n means that covariance matrix M n is projected on the tangent space at the reference point M re f . Then, to get the vector representation, a vectorization operation Vec() is performed such that: with X ij the elements of X at row i and column j. Those two operations yield to the definition of the log-Euclidean vector representation of M n computed at the reference point M re f , denoted m where : These covariance matrices are projected on the tangent space at M re f ; they lie in a vector space where conventional image processing and machine learning methods can be used. Within this framework, the tangent space is computed at a reference point M re f as shown in (1). Different choices can be made for this reference point, such as the identity matrix, the center of mass or the median. The use of the identity matrix I d for this latter is undoubtedly the simplest and the most usual way to map covariance matrices on the tangent space. This choice will be made for the following. In that case, the log map operator in Equation (1) vanishes to: This consists of computing the ordinary matrix logarithm. Let A = VDV T be the eigenvalue decomposition of an SPD matrix, the logarithm is defined as: log(A) = V log(D)V T . Since D is the diagonal matrix of eigenvalues, log(D) is also a diagonal matrix whose diagonal elements are the logarithm of the eigenvalues. In the next two sections, this log-Euclidean framework is employed for two hybrid architectures where the covariance matrix is computed for CNN features.

Local Covariance Pooling: Ensemble Log-Euclidean Fisher Vector Architecture
A scene image is composed by a set of visual elements. For example, an harbour scene is formed by many objects such as boat, water, pontoon, . . . In this context, coding based methods such as FV or VLAD descriptors have reached the state-of-the-art at the beginning of the 2000's [2][3][4]. These methods relies on the creation of a codebook where codewords represent meaningful object parts of the scene. More recently, deep learning models (and CNN in particular) have shown to outperform these coding methods by a significant margin. For instance, on the ImageNet large scale visual recognition challenge, deep learning based methods have won since 2012 [13]. In order to benefit from both strategies, in the recent literature on scene classification, many authors have introduced hybrid architectures that combine CNN with some coding methods. For example, Perronnin et al. [14] have proposed a network of fully connected layers trained on the FV descriptors. Simonyan et al. introduced in [15] the Fisher network, which is composed of several stacked FV layers. Later, Arandjelovic et al. [16] proposed the NetVLAD layer, which mimicks the VLAD layer. Building on the success of those latter hybrid architectures, more attention is given to a particular approach introduced in [20]. In that paper, Li et al. have proposed a hybrid structure, which consists of encoding each output of the convolutional layers of a pre-trained neural network with FV. This technique has demonstrated competitive results for remote sensing scene classification. To capture various scale phenomenons when applying the FV encoding, a Gaussian pyramid is considered. This permits generating multiscale images by using a Gaussian smoothing and sub-sampling at different scales as detailed in [20]. Classification results have demonstrated the interest of using multiscale images compared to a single input image. Therefore, a pyramid of three scale levels is retained in the following. Those multiscale images are fed into the CNN model, allowing the extraction of convolutional features which are then concatenated before being encoded with FV. Note that CNN models are used only to extract deep features without any retraining from scratch or fine-tuning. In fact, once the multiscale features are extracted from each convolutional layer, an individual codebook is generated. In this approach, the dimension K of the codebook is the same for all the layers. The CNN features are then encoded with the improved FV [5]. Next, those FVs are fused to represent the mid-level feature vectors of a scene image. Therefore, this approach does not consider second-order features, which have proved to be efficient in many classification problems and have shown to outperform first-order features for many image processing applications, including material recognition and person re-identification. To this aim, we have proposed in [28] a novel hybrid architecture named Hybrid LE FV, which integrates second-order features in the classification algorithm, as illustrated in Figure 2. This consists of the log-Euclidean Fisher Vector (LE FV) encoding of the covariance matrices of CNN features computed locally on layers output. The next Section 3.1 presents in details the principle of this Hybrid LE FV approach starting from the extraction of region covariance matrices to the FV encoding with the learned codebook [28]. Then, aiming at improving the classification performance, a proposition of an ensemble learning version of Hybrid LE FV strategy is detailed in Section 3.2.

Region Covariance Matrices
The first step is to extract the region covariance matrices computed on a sliding window on the feature map of a CNN. Hence, each image is represented by a set M = {M n } n=1:N of covariance matrices M n ∈ P d . As the size of the output CNN layer depends on layer depth, only the first and second layers of a CNN are considered for computing local covariance matrices. Indeed, for the deepest layers, the feature maps are of small spatial dimension which does not allow the extraction of a large set of covariance matrices. For this purpose, a particular attention is given to the choice of the CNN model. Here, the employed CNN model is a very deep convolutional network named vgg-vd-16 [49]. It is composed of 16 weight layers and is characterized by using a simple 3 × 3 convolutional layer stack with a stride fixed to 1 pixel and a spatial padding of 1 pixel. Therefore, the size of the output feature map is preserved through the first two layers that permit the extraction of a sufficient set of region covariance matrices. Then, according to the log-Euclidean framework detailed in Section 2, these region covariance matrices are encoded with the LE FV. For that, a codebook is first learned by considering a Gaussian mixture model on the manifold of SPD matrices.

Gaussian Mixture Model and Codebook Creation
Let's consider the following GMM model : where p(M|M k , Σ k ) is a multivariate Gaussian distribution defined on the tangent space of the identity matrix. Its probability density function is given by: are respectively the weight, mean and covariance matrices for the kth component of the GMM model. In addition, the classical assumption of diagonal covariance is the variance vector [4]. Moreover, Equation (7) can be rewritten as: where is the log-Euclidean mean vector for the kth component of the GMM model, and m T I d is the LE vector representation of M given by Equations (4) and (5). Since covariance matrices are projected into the tangent space and represented by their corresponding LE vectors, all the algorithms developed on a vector space can be used. In particular, the EM algorithm for parameter estimation of a GMM model is used to estimate the weights, means, and dispersion parameters. The set of these estimated parameters represents the codebook that will further be used to encode the set of region covariance matrices extracted from each image.

Log-Euclidean Fisher Vector Encoding
Considering X = (m log-Euclidean vectors extracted locally from the first convolutional layers of an image. The LE FV encoding consists of projecting these local features onto the codebook defined in the previous subsection. The LE FV descriptor assigned to X is obtained by computing the gradient of the log-likelihood with respect to GMM model parameters, scaled by the inverse square root of the Fisher Information Matrix (FIM) F λ [4]: Here, λ represents each of the distribution parameters (ω k , µ k and σ k ). In practice, the derivatives with respect to the mean µ k (j) and standard deviation σ k (j) have been found to be the most useful [4]. Hence, the following two FVs are obtained after deriving with respect to these two elements where µ k (j) (resp. σ k (j)) is the jth element of vector µ k (resp. σ k ) and γ k (m Once FV descriptors are obtained, a post-processing step is conventionally used to enhance the classification accuracy [5,8]. This consists of a power and an 2 normalization. Furthermore, to avoid the curse of the dimensionality phenomenon when the dimensionality of the FV descriptor is high, a dimension reduction step can be used. In the following, the Kernel Discriminant Analysis (KDA) is considered [50]. Finally, a classification with a linear SVM is performed to make the decision for each test image depending on the information contained in the FV vector representation.

Sensitivity Analysis
As explained in the previous subsection, two parameters have to be tuned for the proposed Hybrid LE FV method, namely the number of components K in the GMM model and the dimension d of the covariance matrices. To evaluate the influence of each parameter on classification accuracy, some experiments are carried out on the UC Merced Land Use Land Cover dataset [51]. This dataset is composed of 21 classes where each class contains 100 remote sensing images of dimension 256 × 256 pixels. Figure 3 shows some examples of the UC Merced dataset image classes. In order to prove the efficiency of the proposed approaches in challenging conditions, only a small set of p = 10% images is retrained for training for all experiments and the remaining images are used for testing. Classification results are evaluated in terms of overall accuracy averaged on five runs.

Airplane
Forest Tennis court Parking  Here, the number of GMM components is fixed equal to 30. The dimension d is the number of selected principal components. If d is too small, a low number of principal components is retained. All the variability is not well explained, which leads to low classification accuracy. When d increases, more variability is explained, and the classification performance also increases. But after a certain value (d = 5 in our experiments), the variance gain is not so important and the classification performance remains quite stable. Hence, it is recommended to consider a covariance matrix size greater than a value of d = 5.
To evaluate the sensitivity of the proposed approach to number of GMM components, Table 1 shows the classification accuracy using three values of K in the GMM model. As observed, the approach isn't sensitive to the codebook dimension. Table 1. Classification accuracy of Hybrid LE FV using three codebook dimensions K.

Ensemble Hybrid Log-Euclidean Fisher Vector (Ens. Hybrid LE FV)
In machine learning, ensemble learning strategies have become more and more popular [52,53]. They rely on the combination of multiple weak classifiers to form a stronger one, hence allowing improvements to the classification performance. Inspired by this idea, we introduce an ensemble learning approach for the hybrid log-Euclidean Fisher vector presented in the previous subsection. The workflow of this method named "Ens. Hybrid LE FV", is shown in Figure 5. As observed, for each convolutional layer (conv 1 and/or conv 2), N subsets are considered. For each subset, d feature maps are randomly selected with replacement. Then, the hybrid log-Euclidean Fisher vector presented before is applied to obtain a decision for this subset. In the end, a majority vote over these decisions is considered to obtain the final prediction. A first experiment is conducted in order to evaluate the sensitivity of the proposed approach. This consists of evaluating the influence of the number of subsets N . Table 2 shows the classification accuracy of the "Ens. Hybrid LE FV" strategy regarding the first convolutional layer of Vgg-vd-16 model. Five values of N are experimented (5, 7, 9, 11, and 13) for p = 10% of training images of the UC Merced dataset. Table 2. Classification accuracy of "Ens. Hybrid LE FV" using different number of subsets N .

Method
Ens. Hybrid LE FV 63.7 ± 0.6% 64.0 ± 0.3% 64.0 ± 0.3% 63.9 ± 0.1% 64.0 ± 0.5% One can observe that results remain quite stable of the considered subsets N . For further experiments, the number of subsets N will be fixed to 7. Table 3 highlights the classification results obtained on the UC Merced dataset for the first (conv 1) and second (conv 2) convolutional layers of vgg-vd-16 network. The proposed ensemble learning approach, "Ens. Hybrid LE FV", is compared to two closely related state-of-the-art strategies. The first one, named "Hybrid FV", consists of encoding the output of the convolutional layers with FV [20]. Note that this approach considers only first-order statistics. The second one, named "Hybrid LE FV" is the one presented in Section 3.1. It exploits second-order statistics but not in an ensemble learning approach [28]. As observed in Table 3, the benefit of exploiting second-order statistics is clearly demonstrated for the first and second CNN convolutional layers. A significant gain of 20% to 25% is reported for the proposed "Hybrid LE FV" and "Ens. Hybrid LE FV" methods compared to the conventional "Hybrid FV" approach. In addition, for these first two layers, a significant gain is observed when exploiting an ensemble learning strategy compared to the use of a single classifier. In this approach, only covariance matrices computed on the first layers of a CNN have been encoded with the LE FV. Indeed, as the deepest convolutional layers of the vgg-vd-16 network are of relatively small spatial dimensions, it is irrelevant to compute a sufficient number of region covariance matrices. Nevertheless, the deepest layers may provide useful features for the classification. To alleviate this issue, instead of considering a local approach, the covariance matrix will be computed globally for the deepest feature maps. For that, Section 4 introduces our ensemble learning approach based on a global covariance pooling of CNN features [44].

Main Motivations and Global Principle
Willing to exploit second-order statistics on deep convolutional layers of a CNN, He et al., have proposed in [35] a strategy named multilayer stacked covariance pooling (MSCP). The originality lies in the replacement of the usual first-order pooling (i.e., average or max pooling) in a CNN by a second-order pooling (i.e., covariance pooling). Note also that, in contrast with the ensemble hybrid LE FV method introduced in Section 3.2, where each layer is presented by a set of covariance matrices computed locally on the feature maps, a single covariance matrix is computed for MSCP, which can significantly speed up the computation time. MSCP has successfully been validated for remote sensing scene classification, but it suffers from two main drawbacks. First, it does not exploit an ensemble learning approach. A single decision is obtained at the end. Second, and probably the main drawback, is that the averaging operator used before the covariance pooling may lead to a not well-conditioned covariance matrix. There is no practical reason that the average descriptor obtained on one subset should be different from the one calculated on another subset. To overcome these problems, we have introduced in [44] a novel hybrid approach named ELCP, which consists of an ensemble learning approach based on covariance pooling of CNN features. The global principle is shown in Figure 6. A downsampling to the smallest spatial dimension is performed using a bilinear interpolation to stack the feature maps of these latter layers. Furthermore, for each image, an ensemble learning approach is considered where the stacked feature maps generated by the convolutional layers are split into N subsets of k features each. This splitting is achieved by random sampling with replacement. Then, for each subset n, a covariance pooling strategy is adopted. It consists in computing the k × k covariance matrix C n . The log-Euclidean framework presented in Section 2 is then adopted to represent C n in the tangent plane of the identity matrix by c T I d n according to Equation (4). Then, for each subset, these log-Euclidean vectors are fed to a base linear SVM classifier allowing them to obtain a decision. The final prediction is obtained as the most represented decision among the N subsets.
For more details on the sensitivity of ELCP to its input parameters, the interested reader is referred to [44]. Since the classification results for this method are stable and not so sensitive to parameter tuning, the number of subsets N and the number of feature maps k per subset retained in the following will, respectively be equal to 20 and 170 as suggested on [44].

Experimental Results
This subsection presents some comparison of the proposed ELCP approach with some standard and recent state-of-the-art approaches on the UC Merced dataset where 10% (p = 10%) of the samples are used for training. A first approach is the FV encoding of handcrafted SIFT features (FV SIFT) [5]. The next approaches are transfer learning methods based on the vgg-vd 16 pre-trained CNN model on the ImageNet dataset. A fine-tuning of this model is first considered (CNN (vgg-vd-16 fine-tuned)). For that, the convolutional layers are frozen, and a fully connected layer is added and trained on the UC Merced dataset. The second transfer learning approach (vgg-vd-16 feat. extraction + SVM) consists in considering the CNN model as a feature extractor. CNN features are then fed to an SVM classifier. Finally, the two second-order based methods, namely MSCP and the proposed ELCP approaches, are compared. Table 4 summarizes the classification results obtained for these five methods. Table 4. Classification performance of the proposed multi-layer architecture compared to the state-of-the-art on the UC Merced dataset (p = 10%).

Method
OA (Mean ± sd) FV (SIFT) [5] 62. 3  As observed in Table 4, several conclusions can be drawn. First, deep learning-based methods outperform traditional handcrafted based ones. Second, since a low number of samples is used for training in this experiment, a fine-tuning strategy does not provide the best results. It is better to consider a pre-trained CNN model as the feature extractor [43,55]. A gain of more than 20% is observed between these strategies. Third, among the transfer learning strategies based on feature extraction, methods exploiting second-order statistics of CNN features (MSCP and ELCP) outperform the first-order one. Fourth, by exploiting an ensemble strategy, the proposed ELCP significantly outperform MSCP. A gain of about 2% is observed.

Comparison Between Ens. Hybrid LE FV and ELCP Methods
Two transfer learning approaches have been presented, namely Ens. Hybrid LE FV in Section 3 and ELCP in Section 4. There are some similarities between these two methods. Both are based on covariance pooling of CNN features, where the log-Euclidean framework presented in Section 2 is adopted. They also exploit an ensemble learning approach. The main difference is that second-order statistics of CNN feature maps are computed locally on the first layers for Ens. Hybrid LE FV, while they are computed globally on deeper layers for ELCP. Unsurprisingly, as observed in Tables 3 and 4, ELCP has better classification performance than Ens. Hybrid LE FV since it exploits deeper CNN features. A gain of 26% and 20% are, respectively, observed for ELCP compared to the first and second layers of Ens. Hybrid LE FV. However, by looking closely at the classification results, it is possible to find some images that are well classified only by Ens. Hybrid LE FV, whereas ELCP fails at this task. Figure 7 shows some images from the UC Merced dataset with the predicted class by these methods. As observed, the first two ones are correctly classified only by Ens. Hybrid LE FV, while for the last two ones, only ELCP succeeds. By taking a closer look at these results, it can be observed that, for the first two images which belong to the baseball diamond class, ELCP seems to focus on the road and building located at the top of the images. Since it exploits deeper layers of a CNN, ELCP learns high-level features that are not so useful for these particular images. Low-level features are sufficient for these images. On the other hand, the third and fourth images of Figure 7, are well classified only by ELCP; since the scene is more complex, high-level features are helpful. It therefore seems natural to combine Ens. Hybrid LE FV and ELCP in order to benefit from both low-level and high-level features. Based on the principle of the most diverse ensembles, the next subsection presents a simple fusion scheme between these two approaches.

Fusion Scheme
As previously mentioned, Ens. Hybrid LE FV and ELCP methods can be complementary since they exploit features extracted from different layers. To benefit from both strategies, many multiple classifier systems have been proposed in the literature, such as dynamic selection techniques [56]. However, the goal here is not to provide the best way to combine Ens. Hybrid LE FV and ELCP methods but rather to show the potential of their fusion. For that, we will focus on two standard and straightforward strategies. The first one, denoted as Fusion Ens. Hybrid LE FV-ELCP (MV), is simply a majority vote on the decision obtained on the output of each subset of Ens. Hybrid LE FV and ELCP. The second one, denoted as Fusion Ens. Hybrid LE FV-ELCP (MDE+MV), selects the most diverse ensembles (MDE) from these methods according to the disagreement diversity measure and greedy optimization [53]. In the end, a majority vote on these selected ensembles is performed. Table 5 summarizes the main results obtained on the UC Merced dataset for the original Ens. Hybrid LE FV and ELCP approaches and their fused versions. As observed, since the classification performances are significantly better for ELCP than Ens. Hybrid LE FV, a simple majority vote is not adapted. The accuracy of this fusion scheme (MV) is profoundly affected by the Ens. Hybrid LE FV scheme. However, by selecting the most diverse ensembles (MDE+MV), a slight gain is observed compared to ELCP, illustrating its potential.

Experiments on Other Datasets
In this section, experiments on other remote sensing scene classification datasets are conducted to evaluate the effectiveness of the proposed approach. For that, the SIRI-WHU Google dataset [57], the AID dataset and two real texture datasets, respectively, for maritime pine forest and on oyster fields [58,59] were tested. In order to prove the efficiency of the proposed approaches in challenging conditions, only 10% of images were considered for training.

SIRI-WHU:
This is a 12-class Google image dataset, where each class contains 200 images of 200 × 200 pixels, with a 2-m spatial resolution. This dataset was acquired from Google Earth and covers urban areas in China. Figure 8 shows some image examples of the dataset.

Agricultural
Industrial Park Overpass

Maritime pine forest:
This dataset comprises four classes of panchromatic Pléiades satellite images with a spatial resolution of 50 cm, which represent a monitoring of growing maritime pine tree stands. Figure 9 illustrates one image from each age class.

Oyster racks:
This five-class dataset is also formed from panchromatic Pléiades satellite high-resolution images. It is comprised, in particular, of images representing cultivated oyster racks and abandoned fields. Figure 10 shows one image of each class of the oyster dataset.

Foreshore
Oyster racks Disused fields Sand Salt-meadow

AID:
This dataset contains 10,000 aerial images of dimension 600 × 600 pixels partitioned into 30 classes, with a 2-m spatial resolution. Figure 11 illustrates some dataset images.

Center
Forest Stadium Port Figure 11. Samples from the AID dataset. Table 6 below summarizes the main characteristics of the considered datasets. The experiments carried out consist of validating the proposed fusion scheme of the two proposed ensemble learning approaches, namely the Fusion Ens. Hybrid LE FV-ELCP (MDE+MV) strategy. Table 7 summarizes the main results. As observed, a similar conclusion can be draw from these four datasets. Firstly, the ELCP approach performs better than Ens. Hybrid LE FV on first and second CNN convolutional layers due to the considered convolutional layer depth. This clearly illustrates the interest of exploiting deep feature maps from CNN model, which characterizes high-level features compared to the first ones. Secondly, a similar conclusion can be drawn to the one obtained from the UC Merced dataset: the fusion of both local and global second-order statistics computation strategies permits enhancing classification performance, which illustrates the multi-layer fusion efficiency.

Conclusions
This paper has introduced a new transfer learning approach based on the covariance pooling of CNN features maps. The proposed ensemble learning approach consists of the fusion of two hybrid architectures. These two strategies use features extracted from models pre-trained on the ImageNet dataset. The former exploits low-level features extracted from the first and second layers. It consists of the log-Euclidean Fisher vector encoding of region covariance matrices computed locally, while the latter uses high-level features issued from deeper layers that are pooled together by computing their covariance matrix. These two strategies share many similarities. They are ensemble learning strategies based on the log-Euclidean representation of the covariance matrix of these CNN features. However, since they exploit feature maps extracted from different layers, they can be considered as complementary. These two ensemble learning strategies were hence combined together using the strategy of the most diverse ensembles. The proposed approach was then successfully validated on various dataset for remote sensing scene classification, illustrating its efficiency and the interest of second-order features. Competitive results have been obtained, with a gain of about 1 to 2% in term of overall accuracy, compared to the recent state-of-the-art.
Since the proposed approach is based on covariance pooling of CNN features, any deep convolutional neural network can be used as backbone. Future works will concerns the adaptation of the proposed strategy to multispectral or hyperspectral images dataset, where a CNN will be used for this kind of data [60,61]. Funding: This work was financially supported by the "PHC Sakura" program (project number 45095SK), implemented by the French Ministry for Europe and Foreign Affairs, the French Ministry of Higher Education, Research and Innovation and the Japan Society for Promotion of Science. The authors would also acknowledge the financial support of Bordeaux Sciences Agro and the Regional Council of Nouvelle Aquitaine, France.