Improving the Separability of Deep Features with Discriminative Convolution Filters for RSI Classification

The extraction of activation vectors (or deep features) from the fully connected layers of a convolutional neural network (CNN) model is widely used for remote sensing image (RSI) representation. In this study, we propose to learn discriminative convolution filter (DCF) based on class-specific separability criteria for linear transformation of deep features. In particular, two types of pretrained CNN called CaffeNet and VGG-VD16 are introduced to illustrate the generality of the proposed DCF. The activation vectors extracted from the fully connected layers of a CNN are rearranged into the form of an image matrix, from which a spatial arrangement of local patches is extracted using sliding window strategy. DCF learning is then performed on each local patch individually to obtain the corresponding discriminative convolution kernel through generalized eigenvalue decomposition. The proposed DCF learning characterizes that a convolutional kernel with small size (e.g., 3 × 3 pixels) can be effectively learned on a small-size local patch (e.g., 8 × 8 pixels), thereby ensuring that the linear transformation of deep features can maintain low computational complexity. Experiments on two RSI datasets demonstrate the effectiveness of DCF in improving the classification performances of deep features without increasing dimensionality.


Introduction
In recent years, remote sensing image (RSI) classification has attracted remarkable attention and is becoming increasingly important in a wide range of applications, such as geographic image retrieval, object detection, environment monitoring, and vegetation mapping [1].Learning robust RSI representations plays an important role in RSI classification because rich geometric structures and spatial patterns exist in RSIs [2].Over the past few years, numerous methods have been proposed for RSI classification.The methods can generally be categorized into three aspects according to feature type [1]: (1) methods based on handcrafted features; (2) methods based on unsupervised feature learning; and (3) methods based on deep learning.
For the first category, representative handcrafted features are color histogram [3], scale-invariant feature transform (SIFT) [4], histograms of oriented gradients (HOG) [5], local binary patterns (LBP) [6,7], Gabor [8], and GIST [9,10] which is an abstract representation of scene.Yang and Newsam [3] investigated color histogram and SIFT-based bag-of-visual-words (BOW) representations for RSI classification.Cheng et al. [5] proposed to train part detectors by using HOG feature pyramids to extract distinguishable features for RSI representation.Ren et al. [7] proposed LBP structure learning based on incremental maximal conditional mutual information.Risojevic and Babic [8] presented an enhanced Gabor texture descriptor (EGTD) based on cross-correlation within the spatial frequency subbands of Gabor decomposition.In general, methods based on handcrafted features usually extract single-feature cues from images and should be redesigned (e.g., parameter setting) for a new dataset, thereby making RSI classification heavily dependent on expert experiences.
For the second category, k-means clustering, sparse coding, and autoencoder are commonly used in unsupervised feature learning methods.As a vector quantization method, k-means clustering aims to partition training samples into k clusters, ensuring that the distances of the training samples within a cluster are similar while the training samples that belong to two different clusters are dissimilar.With the use of numerous local descriptors (e.g., SIFT, HOG, and LBP) extracted from a set of training images, k-means is widely used for learning the visual dictionary of BOW-based mid-level representation.Sparse coding [11] aims to learn an overcomplete dictionary from unlabeled samples, ensuring that an image can be efficiently represented through a linear combination of the basis functions in the overcomplete dictionary.Cheriyadat [12] proposed the use of sparse coding to learn a set of basis functions from low-level features (e.g., dense SIFT) for RSI classification.As a neural network method, autoencoder aims to learn a low-dimensional feature representation from high-dimensional features in an unsupervised manner.Zhou et al. [13] proposed to learn sparse features based on autoencoder for RSI retrieval.Othman et al. [14] proposed to use convolutional features and sparse autoencoder for RSI classification.Unlike handcrafted features, unsupervised feature learning can automatically learn meaningful features rather than find the best design of a given dataset.
For the third category, deep learning [15] has been intensively used for visual recognition, with convolutional neural network (CNN) [16] being a popular topic in the deep learning community for automatic learning of visual features.In the past years, numerous CNN architectures, such as AlexNet [16], VGG (Visual Geometry Group)-VD (Very Deep) [17], GoogLeNet [18], ResNet [19] have been proposed for image recognition (e.g., ImageNet dataset [20]).In the field of RSI classification, Zhang et al. [21] proposed a gradient-boosting random convolutional network (GBRCN) framework with the use of RSI training data for land use classification.However, fully designing and training a new CNN architecture in remote sensing applications is always difficult because training a CNN model requires a large-scale labeled dataset, which is unusual in the remote sensing community [22].Therefore, determining how existing pretrained CNNs (e.g., trained on ImageNet) can be better used is an interesting task for obtaining high performances.Zeiler and Fergus [23] pointed out that the activation vector extracted from the fully connected layer of a CNN can be used as a powerful image descriptor for the feature extraction of other datasets.Penatti et al. [24] evaluated the generalization of deep features using pretrained CNN models such as CaffeNet [25] and OverFeat [26].Castelluccio et al. [27] demonstrated that fine-tuned CNN could obtain better classification performance compared with pretrained CNN.Wan et al. [28] proposed a cascade representation framework for RSI classification based on a set of pretrained CNNs.Recently, Nogueira et al. [22] presented a comprehensive analysis of three possible strategies to investigate the power of existing CNNs.Their results illustrate that using the activation vectors extracted from the fully connected layer of a CNN model as RSI representations followed by linear support vector machine (SVM) [29] can yield the best classification performance.
In general, effective feature representation is beneficial for the subsequent stage of classifier training.In previous years, Kumar et al. [30,31] proposed the use of Volterra theory for the first time to learn discriminative convolution filters (DCF) from pixel features on gray-level images.The supervised learning process of DCF is based on a class-specific separability criteria, which can be converted into generalized eigenvalue decomposition.However, the pixel features within an image are strongly correlated to nearby pixels.With the development of deep learning, the deep features extracted by a CNN model can obtain different levels of data abstraction from pixel features.Deep features have less feature redundancy and higher distinguishability than pixel features, thereby making DCF learning on deep features more interesting than on pixel features.
According to the recent review of CNN [22], the best performing deep features for image representation (e.g., RSI) can be obtained from the fully connected layer.Deep features are represented by an activation vector rather than a feature map matrix because of the characteristics of fully connected layers.For example, several widely used CNN architectures, such as AlexNet, CaffeNet, and VGG-VD, contain 4096-dimensional activation vectors in their fully connected layers.DCF was originally proposed to learn spatial kernels on gray-level images with 64 × 64 pixels.Thus, vector-based deep features (activation vectors) cannot be directly used to learn spatial kernels through DCF.Interestingly, a 4096-dimensional activation vector can be reshaped to a matrix with 64 × 64 features through rearrangement, thereby making DCF learning on activation vectors feasible.
On this basis, we explore the first-order (linear) form of DCF learning for the transformation of activation vectors in this study to improve the separability of deep features and the classification performance of RSI without increasing the dimensionality of feature representation.Applying DCF learning to activation vectors is based on the hypothesis that the rearrangement of activation vectors contains discriminant spatial structures, which help increase the probability that a statistical learning model will reveal interesting regularities.In particular, the activation vectors in a CNN model are initially extracted from the fully connected layers.To illustrate the generality of DCF learning for activation vectors, two types of pretrained CNNs, namely, CaffeNet [25] and VGG-VD16 [17], are used for comprehensive performance comparisons.Then, the activation vector is rearranged into the form of an image matrix, from which local patches are extracted using the sliding window strategy.Finally, DCF learning is performed on each local patch individually through a generalized eigenvalue problem.The advantage of the proposed method is that small DCF kernels (e.g., 3 × 3 pixels) can be effectively learned on small local patches (e.g., 8 × 8 pixels).Therefore, the transformation (linear convolution) of the activation vectors can maintain low computational complexity.The effectiveness of DCF transformation for activation vectors is further evaluated by supervised classification with linear SVM, which is widely used in CNN-based RSI classification [1,2,22,24,28].Experiments on two publicly available RSI datasets demonstrate that DCF helps improve the classification performance of the activation vectors obtained by CaffeNet or VGG-VD16.

Proposed Method
As shown in Figure 1, the proposed method consists of the following stages: (1) extracting 4096-dimensional activation vector from the fully connected layer of a pretrained CNN, followed by L2 normalization; (2) rearranging the activation vector into the form of an image matrix, from which small-size local patches are extracted by using the sliding window strategy with a fixed stride, followed by DCF learning on each local patch to obtain a convolutional filter; and (3) generating linear transformation of the deep features on the basis of the learned filters.

Deep Features
Pretrained CaffeNet and VGG-VD (obtained by MatConvNet [32]) are selected to extract deep features and to demonstrate the generality of the proposed DCF for the transformation of different types of deep feature.We select pretrained CNN to extract deep features because a CNN model learned on a large-scale dataset has good generalization on other tasks (e.g., RSI classification) and no retraining process is needed for the target application.Previous works [22,24] indicated that the deep features, which are the so-called activation vectors in this study, can be obtained from the last fully connected layers (except for the classification layer).Thus, such strategy is employed in our deep feature extraction.CaffeNet consists of five convolutional layers and three fully connected layers.The convolutional layers contain a linear convolution followed by one or more nonlinear operations, such as rectified linear units (ReLU), local response normalization, and max pooling.The input image size is 227 × 227 pixels with three channels (red-green-blue).The first convolutional layer contains 96 kernels (receptive fields or filters) with a size of 11 × 11 × 3 pixels.The second convolutional layer contains 256 kernels with a size of 5 × 5 × 48 pixels.The third convolutional layer contains 384 kernels with a size of 3 × 3 × 256 pixels.The fourth convolutional layer contains 384 kernels with a size of 3 × 3 × 192 pixels.The fifth convolutional layer contains 256 kernels with a size of 3 × 3 × 192 pixels.Each fully connected layer (except for the last classification layer) contains 4096 neurons.In this study, we extract 4096-dimensional activation vectors from the first and the second fully connected layers.
In addition, data augmentation shown in Figure 2 is performed by sampling sub-images from the original input image and averaging the activation vectors of these sub-images, similar to the prevalent "center + corners with horizontal flips" augmentation [16,28,33].First, the original input image is resized to 256 × 256 pixels.Then, five sub-images (corresponding to the center and four corners) and their horizontal flips are cropped from the original image.Finally, each sub-image is used to extract two 4096-dimensional activation vectors from two fully connected layers.The final representation of the original input image can be obtained by averaging the 4096-dimensional activation vectors over the 20 sub-images, followed by L2 normalization.VGG-VD16 consists of 13 convolutional layers, five pooling layers, and three fully connected layers.The input image size is 224 × 224 pixels with three channels (red-green-blue).The kernel size of each convolutional layer is 3 × 3 pixels, which is the smallest receptive field size.The convolution stride is set to 1 pixel, and the size of the feature maps is preserved after convolution extraction with spatial padding (1 pixel).Max pooling is used in five pooling layers, which follow some of the convolutional layers (not all the convolutional layers are followed by pooling).The size of a pooling region is set to 2 × 2 pixels with a stride of 2. For the three fully connected layers, the first two fully connected layers contain 4096 neurons; the third fully connected layer contains 1000-way (corresponding to 1000 neurons) ILSVRC classification.Similar to the feature extraction of CaffeNet, we extract two 4096-dimensional activation vectors from the first and second fully connected layers of VGG-VD16, followed by data augmentation and L2 normalization.

Supervised DCF Learning
Given a 4096-dimensional activation vector extracted from a pretrained CNN (CaffeNet or VGG-VD16), we can rearrange it into the form of an image matrix with the use of row priority or column priority, thereby resulting in a feature map with 64 × 64 pixels.The rearrangement of activation vector aims to learn DCF kernels with spatial arrangement characteristics.
To learn the DCF kernels, the 64 × 64-pixel feature map is first divided into a spatial arrangement of local patches through sliding window strategy.Given a set of local patches X = {x 1 , x 2 , • • • , x N } extracted from the same spatial location with respect to a set of training feature maps (training images), each local patch x i with r × r pixels belongs to a specific class of C = {c 1 , c 2 , • • • , c K }.Through an unknown function f , these local patches can be mapped into other representations that satisfy the objective function to minimize in the L2-distance.The objective function can be defined as follows: where the numerator measures the within-class distance and the denominator measures the between-class distance.Here, we seek a linear transformation (linear filter) that maps these patches to a new representation such that the L2-distance of the within-class is minimized while the L2-distance of the between-class is maximized.Thus, Equation (1) can be described as where ⊗ is the convolution operator, and K is the DCF kernel that we need to learn.
To learn K in Equation ( 2), we need to keep K as a vector form.Thus, we transform x i into a new representation A i (Figure 3), such that where − K is the vectorized form of K.For a local patch x i with r × r pixels and a filter K with w × w pixels, the transformed matrix A i with r 2 × w 2 dimensions can be constructed by vectorizing the neighborhoods of w × w dimensions at each pixel in x i , as shown in Figure 3. Thus, we can obtain the following equation by substituting the convolution representation of Equation (2): Equation ( 3) can be written as and

Transformation of Deep Features with DCF
Given a 64 × 64-pixel feature map (deep features) and a set of learned DCF kernels, the first step is to divide the feature map into equal-sized local patches (r × r pixels).We allow local patches to overlap with sliding stride of s pixels, resulting in a total number of ( 64−r s + 1) × ( 64−r s + 1) local patches.Correspondingly, ( 64−r s + 1) × ( 64−r s + 1) DCF kernels can be learned individually, according to Section 2.2.
To obtain the new representation, the ( 64−r s + 1) × ( 64−r s + 1) local patches are convolved with the corresponding DCF kernel to obtain the convolutional results, followed by feature concatenation, as shown in Figure 1.The convolution of each local patch is independent.During the convolution, the border pixels of each local patch are padded with zeros, thereby resulting in a D-dimensional new representation, where D = ( 64−r s + 1) × ( 64−r s + 1) × r × r.

Experiments and Discussion
Experiments

Experimental Configurations
Given a feature map with 64 × 64 pixels (reshaped by 4096-dimensional deep features), a 8 × 8-pixel local patch, which performs best in the case of 64 × 64-pixel feature map, is employed to extract the spatial arrangement local patches.DCF learning is then performed on each spatial location individually.After applying the learned DCF to a given feature map, the dimensionality of the final representation is determined by the sampling (sliding window) stride (s), which is analyzed in subsequent experiments.
For both 21-class and 19-class datasets, all results are repeated 10 times to report the average classification accuracy (denoted by mean) and standard deviation (denoted by std).In each round of testing, a fixed number of training images are randomly selected from each class and linear SVM [29] is employed for training, the overall accuracy of the remaining images (the so-called testing images) is used for evaluation.To obtain the overall accuracy, we count the number of correct classification images from a set of k testing images.The correct classification indicates that an image, which belongs to the c i th class, is classified to the c i th class through SVM prediction.Suppose that k testing images are classified correctly.Then, the overall accuracy can be computed by 100 × k k .For the 10 rounds of testing, 10 overall accuracies (e.g., a 1 , a 2 , ..., a 10 ) can be obtained.The average classification accuracy mean and the standard deviation std can then be represented by mean = 1

DCF Kernel Size and Sampling Stride
Given the use of 8 × 8 pixel local patch, Figure 5 compares the effects of different filter sizes and sampling strides on the classification performances of both RSI datasets (under 10% training images per class).On the one hand, given a feature map with 64 × 64 pixels and a local patch with 8 × 8 pixels, the DCF kernel size with 3 × 3 pixels performs best on both datasets.Anything larger than the kernel size of 3 × 3 pixels overfits a local patch with 8 × 8 pixels.For example, the comparison between "filter size @ 3 × 3" and "filter size @ 5 × 5" indicates decreased classification accuracy with the increase in DCF kernel size.On the other hand, the classification performances with respect to three types of sampling stride ("stride @ 4," "stride @ 8," and "stride @ 12") are compared.A large sampling stride corresponds to low dimensions for the final image representation.Given that the local patch size is 8 × 8 pixels, s = 8 indicates that the local patches are extracted by using non-overlapping strategy.Thus, a total of 4096-dimensional features can be obtained after DCF transformation, as the dimension length of the original deep features.Although the dimensionality of the final image representation reduces with the increase in s, the classification accuracy tends to decrease.In addition, the selection of s = 4 does not indicate accuracy advantages over s = 8 on both datasets.Compared with the selection of s = 8 (a total number of 64 filters), the selection of s = 4 would obviously increase the total number of DCF kernels that we need to learn, thereby increasing the burden of DCF training.In general, the selection of s = 8 is a good choice, considering the computational efficiency and classification performance.
With the use of s = 8, Figure 6

Effectiveness of DCF for Deep Features
With the use of s = 8, Figure 7 shows the comparisons of the proposed method with and without DCF to illustrate the improvements of DCF for deep features.The "CaffeNet" or "VGG-VD16" shown in the legend of Figure 7   With an increase in the number of training images per class, the accuracy improvement of DCF decreases.To illustrate that DCF can substantially improve classification performances under different numbers of training images, a statistical significance testing method called Wilcoxon's signed-rank test [35] is used.Given two methods (e.g., deep features with and without DCF), Wilcoxon's signed-rank test analyzes the paired classification accuracies for the 10 rounds of testing.If a substantial accuracy improvement is observed by DCF, then most results obtained by the deep features with DCF will be greater than those obtained by the deep features without DCF and those not greater will be smaller by only a small amount.
Wilcoxon's signed-rank test outputs a probability value P, which is the probability of observing an effect given that the null hypothesis [35] is true.Wilcoxon's signed-rank test analyzes whether the null hypothesis should be rejected.The observed result is statistically significant if the null hypothesis is rejected.Particularly, the null hypothesis can be rejected if P is less than a pre-defined significance level, which is usually set to 0.05 or 0.01 (significance level).As shown in Figure 7, all P values are smaller than 0.05, and these findings indicate that the difference between the deep features with and without DCF is significant despite some small accuracy improvements.In general, the deep features obtained by CaffeNet or VGG-VD16 with DCF are substantially better than those obtained without DCF.
With the use of 10% training images (10 images per class for 21-class and five images per class for 19-class), we further compare the classification performance of each class between deep features with and without DCF transformation, as shown in Figure 8. Overall, DCF helps improve the classification performance of those categories that contain buildings or significant features (e.g., objects or textures).For the 21-class dataset, CaffeNet-based deep features with DCF indicates obvious accuracy advantages on "golf course," "medium density residential," "overpass," "river," and "storage tanks" compared with that without DCF.VGG-VD16-based deep features with DCF indicates obvious accuracy advantages on "density residential," "intersection," "river," "runway," and "storage tanks" compared with that without DCF.For the 19-class dataset, CaffeNet-based deep features with DCF indicates obvious accuracy advantages on "commercial," "forest," and "industrial" compared with that without DCF, and VGG-VD16 with DCF indicates obvious accuracy advantages on "airport," "industrial," "mountain," "pond," and "residential" compared with that without DCF.Figure 9 shows two confusion matrices for the 21-class dataset.For Figure 9a,b, "agricultural," "airplane," "beach," "chaparral," "forest," "golf course," "harbor," "parking lot," and "river" achieve high classification accuracies.The well-performing classes have different characteristics.For example, images in "agricultural" and "forest" have significant textures; images in "airplane" have significant aircrafts that are easy to distinguish from other objects (e.g., buildings); images in "beach" show significant color features; and images in "harbor" or "parking lot" have significant spatial and texture structures.In contrast, some classes perform poorly, such as "dense residential," "medium density residual," "sparse residential," and "tennis courts."These classes have similar characteristics, such as the presence of various buildings that often have similarities across different classes."Tennis courts" performs poorly because the tennis courts are surrounded by buildings and are generally unremarkable.

Analysis of Confusion Matrix
In general, the overall classification performances of both CNNs are similar, but the accuracy performances on several classes (e.g., "storage tanks") are different between CaffeNet and VGG-VD16.Although VGG-VD16 is much deeper than CaffeNet in the network architecture, the overall performance of the former does not indicate advantages over the latter due to the accuracy saturation of this dataset.
Similar to Figures 9 and 10 shows two confusion matrices of the 19-class dataset.For Figure 10a,b, "airport," "beach," "desert," "football field," "meadow," "pond," "river," and "viaduct" indicate high classification performances.Similar to the 21-class dataset, the well-performing classes in the 19-class dataset contain significant textures, colors, significant objects, or spatial structures."Commercial" performs poorly in Figure 10a,b because numerous buildings exist in this class.In general, the two types of pretrained CNN show similar classification performances to that in Figure 9.

Comparisons with Other Methods
Table 1 summarizes the performance comparisons of different methods using different training ratios.Among these comparison methods, EGTD [8] and multiple kernel learning (MKL) [6] are based on handcrafted features.EGTD computes the means and standard deviations of Gabor coefficients and the cross-correlation between these coefficients at different scales or orientations.MKL can determine a suitable combination of a set of handcrafted features automatically.fDNF + FV (fusion Divisive Normalization Features with Fisher Vector) [36] is a mid-level representation method based on local description and Fisher encoding, and the local descriptions are obtained by handcrafted features.Unsupervised feature learning (UFL) [12] consists of low-level feature extraction, feature learning, encoding, and pooling.UFL obtains sparse feature representations through encoding the low-level features with a set of learned basis functions, which are generated by unsupervised learning.GBRCN [21], LPCNN (Large Patch Convolutional Neural Networks) [37], CaffeNet [2], and VGG-VD16 [2] are four types of CNN-based methods.LPCNN investigated an appropriate model to balance the trade-off of CNN and limited trainable images.GBRCN can effectively combine numerous deep neural networks.CaffeNet and VGG-VD16 are based on the pretrained models (trained on ImageNet) to extract 4096-dimensional deep features from images, followed by SVM training and classification [2,22,24].In Table 1, the results for EGTD, MKL, fDNF + FV, UFL, GBRCN, and LPCNN are obtained from the original references; the empty results (denoted by "-") indicate that the corresponding reference does not provide the results.CaffeNet, VGG-VD16, Proposed (CaffeNet with DCF), and Proposed (VGG-VD16 with DCF) are implemented using MatConvNet [32] .All methods are based on learning the training images with a fixed number of images per class and testing the overall classification performance of the remaining images.Several conclusions can be drawn from Table 1.First, local feature representation followed by feature encoding performs well among these methods based on handcrafted features.EGTD cannot encode local information because of the averaging of the wavelet coefficients on image domain.Second, although ULF can learn features from images automatically, it is an unsupervised learning method that cannot learn class-specific separable features unlike supervised CNNs.Feature encoding based on low-level features or UFL can only generate shallow-based mid-level features with limited representative ability, which essentially prevents them from achieving desirable performances.Third, CNN can obtain different levels of abstraction from the input image, ranging from low-level features in the initial layers, mid-level features in the intermediate layers, to high-level features in the final layers.
To obtain an effective CNN model, the training samples play an important role in the CNN-based methods.Among these comparison methods, GBRCN and LPCNN are trained on the 21-class dataset, which contains only 2100 images.By contrast, CaffeNet and VGG-VD16 are trained on the ImageNet dataset with millions of images.Compared with GBRCN and LPCNN, the generality of both CaffeNet and VGG-VD16 is obvious.Finally, the comparison of CaffeNet and "Proposed (CaffeNet-DCF)," and the comparison of VGG-VD16 and "Proposed (VGG-VD16-DCF)" indicate that DCF can improve the classification accuracy on both datasets, especially with fewer training images.

Conclusions
In this study, we propose a novel method for RSI representation based on deep features and DCF kernels to improve the separability of the deep features extracted from CNN models.Given a pretrained CNN model, the deep features are represented by activation vectors extracted from fully connected layers.Then, the deep features are rearranged into the form of an image matrix to obtain a spatial arrangement of local patches.Finally, supervised DCF learning, which helps enhance the distinguishability of activation vectors, is performed on each spatial location individually to learn the corresponding DCF kernel.Experiments on two RSI datasets illustrate the effectiveness of the DCF in improving classification accuracies.In future works, we intend to investigate DCF learning for multiple CNNs simultaneously.

2. 1
.1.CaffeNet In 2012, Krizhevsky et al. [16] proposed a deep CNN architecture, which is the so-called AlexNet, for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has gained great success in the ILSVRC 2012 competition.AlexNet is a breakthrough in image recognition because (1) nonsaturating neurons are used, (2) dropout technique is introduced to prevent overfitting, and (3) GPU implementation is used to accelerate the learning speed.As a reference model in the Caffe open source framework [25], CaffeNet is nearly a replication of AlexNet.Unlike AlexNet, CaffeNet has no data argumentation in the training stage, and the order of normalization and pooling operations in the CNN architecture is exchanged.

Figure 1 .
Figure 1.Framework of the proposed method based on deep features and DCF (Discriminative Convolution Filter).The top row contains the learning stages of pretrained CNN (Convolutional Neural Network) and DCF.The bottom explains the extraction of deep features and their convolutional transformation based on DCF kernels.

Figure 2 .
Figure 2. Data augmentation based on "center + corners with horizontal flips" strategy.
S W and S B are symmetric matrices with w 2 dimensions.The minimum of Equation (3) can be obtained by solving the generalized eigenvalue problem.Thus, the minimum of − K opt is given by the minimum eigenvalue of S −1 B S W and − K equals the corresponding eigenvector (which can be reshaped to a matrix form).

Figure 3 .
Figure 3. Transformed matrix A i for a local patch with 8 × 8 pixels and a DCF kernel K with 3 × 3 pixels.In the first row, nine neighborhoods of the local patch are highlighted.The nine neighborhoods are concatenated to form a row of A i .
are conducted on two publicly available datasets, namely, 21-class land use dataset (denoted by 21-class) and 19-class satellite scene dataset (denoted by 19-class), to illustrate the effectiveness of the proposed DCF in improving the classification performances of deep features, as shown in Figure 4.
shows the differences between deep features (extracted by CaffeNet or VGG-VD16) with and without DCF.Given an input image, the 4096-dimensional deep features, which are non-negative, contain a large number of zero values (black pixels) because a ReLU operation is performed during CNN extraction.By contrast, DCF can produce a large number of non-zero values (positive and negative values) on the basis of the deep features because the DCF kernel summarizes adjacent deep features after rearrangement for each spatial location through linear convolution.In general, DCF extends the representation range of deep features from non-negative to real numbers.

Figure 6 .
Figure 6.Differences between deep features with and without DCF.
illustrates that the deep features (without DCF transformation) are directly used for SVM training and classification.In general, the following conclusions can be drawn: (1) DCF helps improve the classification performances of both pretrained CNNs on the two RSI datasets.(2) With the use of DCF, accuracy improvements can be obtained for different numbers of training images, especially for the case with fewer training images.(3) The selection of 10 training images per class in the 21-class dataset or the selection of five training images per class in the 19-class dataset indicates more improvements than other training ratios because the classification accuracies are close to saturation under a large number of training images (e.g., 80 training images per class in the 21-class dataset or 40 training images per class in the 19-class dataset).Overall, Figure 7 indicates that the patch-based DCF learning can substantially improve the separability of deep features without increasing dimensionality (e.g., s = 8).

Figure 7 .
Figure 7. Effectiveness of DCF for deep features.P is the probability value obtained by Wilcoxon's signed-rank test.(a) CaffeNet-based deep features on the 21-class dataset; (b) VGG-VD16-based deep features on the 21-class dataset; (c) CaffeNet-based deep features on the 19-class dataset; (d) VGG-VD16-based deep features on the 19-class dataset.

Figure 8
Figure 8 compares the classification performance of each class between deep features with and without DCF and identifies which classes can be improved by DCF transformation.However, the factors that affect the classification performance of each class are not clear.In this section, four confusion matrices (CaffeNet with DCF and VGG-VD16 with DCF on two datasets) are provided to determine the classes that easily produce classification confusion, thereby helping explain the effects of other classes on a given class, as shown in Figures 9 and 10.Unlike the results in Figure 8, which are based on selecting 10% training images per class and testing the remaining 90% images on each dataset, the four confusion matrices in Figures 9 and 10 are based on selecting 80% training images per class and testing the remaining 20% images.Figure9shows two confusion matrices for the 21-class dataset.For Figure9a,b, "agricultural," "airplane," "beach," "chaparral," "forest," "golf course," "harbor," "parking lot," and "river" achieve high classification accuracies.The well-performing classes have different characteristics.For example, images in "agricultural" and "forest" have significant textures; images in "airplane" have significant aircrafts that are easy to distinguish from other objects (e.g., buildings); images in "beach" show significant color features; and images in "harbor" or "parking lot" have significant spatial and texture structures.In contrast, some classes perform poorly, such as "dense residential," "medium density residual," "sparse residential," and "tennis courts."These classes have similar characteristics, such as the presence of various buildings that often have similarities across different classes."Tennis courts" performs poorly because the tennis courts are surrounded by buildings and are generally unremarkable.In general, the overall classification performances of both CNNs are similar, but the accuracy performances on several classes (e.g., "storage tanks") are different between CaffeNet and VGG-VD16.Although VGG-VD16 is much deeper than CaffeNet in the network architecture, the overall

Figure 8 .
Figure 8. Comparisons of the classification performance for each class between deep features with and without DCF using 10% training images per class.(a) CaffeNet-based deep features on the 21-class dataset; (b) VGG-VD16-based deep features on the 21-class dataset; (c) CaffeNet-based deep features on the 19-class dataset; (d) VGG-VD16-based deep features on the 19-class dataset.

Figure 9 .Figure 10 .
Figure 9. Confusion matrices for the 21-class dataset in the case of 80 training images per class.All results are given as percentages, and the rows and columns represent the ground truth and classification accuracies, respectively.(a) CaffeNet-based deep features with DCF; (b) VGG-VD16-based deep features with DCF.

Table 1 .
Comparisons of classification accuracy (%) with other methods.