High-Resolution Remote Sensing Image Retrieval Based on CNNs from a Dimensional Perspective

: Because of recent advances in Convolutional Neural Networks (CNNs), traditional CNNs have been employed to extract thousands of codes as feature representations for image retrieval. In this paper, we propose that more powerful features for high-resolution remote sensing image representations can be learned using only several tens of codes; this approach can improve the retrieval accuracy and decrease the time and storage requirements. To accomplish this goal, we ﬁrst investigate the learning of a series of features with different dimensions using a few tens to thousands of codes via our improved CNN frameworks. Then, a Principal Component Analysis (PCA) is introduced to compress the high-dimensional remote sensing image feature codes learned by traditional CNNs. Comprehensive comparisons are conducted to evaluate the retrieval performance based on feature codes of different dimensions learned by the improved CNNs as well as the PCA compression. To further demonstrate the powerful ability of the low-dimensional feature representation learned by the improved CNN frameworks, a Feature Weighted Map (FWM), which can perform feature visualization and provides a better understanding of the nature of Deep Convolutional Neural Networks (DCNNs) frameworks, is explored. All the CNN models are trained from scratch using a large-scale and high-resolution remote sensing image archive, which will be published and made available to the public. The experimental results show that our method outperforms state-of-the-art CNN frameworks in terms of accuracy and storage.


Introduction
With the rapid development of Earth observation technology, remote imaging sensors with high spatial resolution have led to rapid increases in the volume of acquired remote sensing images. However, the effective management and retrieval of large scale remote sensing image databases represent considerable challenges that must be resolved. As a result, Content-Based High-Resolution Remote Sensing Imagery Retrieval (CB-HRRS-IR), which aims to search for and return the most relevant or similar images using a query image, has drawn increasing attention in recent years [1].
Currently, there are two essential modules that serve as solutions to CB-HRRS-IR: the feature representation module and the feature searching module [2]. Specifically, a feature vector is extracted to describe the visual content of an image in the feature representation module. Based on the extracted features, similarities between a query image and other images from the image database are calculated; then, the system returns the most similar images by ranking similarities. Both modules play important roles in an image retrieval system. Obviously, the length of the image features and the method of similarity measurement have a significant impact on the search efficiency, especially for enormous image archives in which the extracted features can largely influence the retrieval performance because of the ability of the features to represent the images.
To achieve a satisfactory performance for CB-HRRS-IR, this paper focuses on extracting powerful features for better remote sensing image representation. Because a high resolution remote sensing image contains abundant information with a large image size, high dimensional feature vectors with hundreds or even thousands of codes are usually employed for the image representation [2,3]. However, regarding the similarity measurements of different images, especially in a large image database, high-dimensional features will increase the number of computation assumptions greatly. Thus, long feature codes with high dimensions for image representation will have a notable negative impact on the retrieval efficiency. In this paper, we propose and analyze Deep Compact Codes (DCCs) with low dimensions for remote sensing image representations to advance the image retrieval efficiency. Specifically, several schemes are experimentally implemented to learn the DCCs via Deep Convolutional Neural Networks (DCNNs) for remote sensing image representations. Additionally, the CNNs are trained from scratch using a large and high resolution remote sensing image archive that we collected. This archive will be made publicly available to other researchers. Our learned DCCs show a better representation for remote sensing image retrieval. This representation can highly improve the image retrieval performance with respect to both precision and efficiency. In addition, we explore a new method called the Feature Weighted Map (FWM) to assist in the visual understanding of deep features. The FWM can facilitate the process of determining the mechanisms underlying the efficacy of the proposed DCCs method. Furthermore, our proposed visualization method can also provide insights into the information that can be learned from the DCNN frameworks.
The remainder of this paper is organized as follows. In Section 2, we introduce the background and review the most relevant studies on remote sensing image retrieval. In Section 3, we introduce the image feature extraction methods, including the DCC learning schemes and PCA compression, and describe the evaluation methods from both quantitative and visual perspectives. In Section 4, we introduce the large remote sensing image archive, which will be released publicly to other researchers, and provide the experimental and analysis results. In Section 5, we describe the FWM. Finally, in Section 6, we draw conclusions regarding this work.

Background and Related Studies
Classical image features, such as spectral features [4,5], texture features [6][7][8], shape features [9,10] and morphological features [7] are the most common features used for remote sensing image representation. Great success with regard to high-resolution remote sensing image retrieval have been achieved using these global features. Compared with global features, local features are also good at representing remote sensing images with high spatial resolution, and they allow for the recognition of a greater range of objects and spatial patterns in small patches. Yang et al. [1] conducted the first investigation of the use of local invariant features for overhead image retrieval. Extensive experiments showed the effectiveness and practicability of local features for high resolution aerial imagery retrieval. Yang et al. [11] proposed a method that represents images using local features based on the typical Bag-of-Words (BoW) framework, which can improve the recognition performance of the remote sensing image retrieval process and reduce the burden of building the image index. Rather than extract either traditional global or local features, Wang et al. [12] and Bosilj et al. [13] explored new methods that can utilize both global and local features for remote sensing image retrieval. Additionally, recent studies have proposed methods that account for structure information in the image representations [14][15][16].
Although accurate features can be extracted via various methods for remote sensing image retrieval, they cannot be easily employed to describe a user's understanding of an image, which is significant to a user's intention. In other words, the semantic contents of an image cannot be well revealed by these features. To alleviate this issue, Liu et al. [17] proposed a region-level semantic mining approach for image presentation and constructed a uniform region-based depiction for each image by segmenting the images by region. Then, the semantic features were extracted using a probabilistic method, which had good retrieval precision and recall. Wang et al. [18] proposed a remote sensing image retrieval scheme using image scene semantic matching. In addition, a prototype system that uses a coarse-to-fine retrieval scheme was implemented, and it had good retrieval accuracy. Recently, Linda et al. [19] presented a novel semantic mining and hashing method for remote sensing image retrieval, which showed good performance in their implemented retrieval system.
Most of the features mentioned in the above studies were low-level features that were individually designed, and they have been employed with a certain degree of success in remote sensing image retrieval. However, designing a stable and powerful feature representation for images could be a difficult task. In addition, remote sensing images with high resolution usually represent large geospatial scenes that contain abundant and complex visual contents. These factors can reduce the ability of low-level features to represent high-level abstract concepts in remote sensing images, which is known as the semantic gap between low-level features and high-level semantic content.
Recently, deep learning was shown to achieve considerable success in many tasks, including speech recognition [20,21], object recognition and detection [22][23][24][25] and natural language processing [26,27]. Inspired by such great success, high-level features extracted via deep learning have been introduced in the application of content-based high-resolution remote sensing image retrieval. Zhou et al. [28] utilized an unsupervised feature learning framework based on auto-encoder to map low-level feature descriptors to sparse feature representations for remote sensing image retrieval. Li et al. [2] also employed unsupervised multilayer feature learning and collaborative affinity metric fusion for remote sensing image retrieval. These methods can offer a higher-level feature representation of remote sensing images and outperform conventional features. However, the improvements are limited, and the unsupervised feature learning framework might not provide results that can be generalized because these frameworks are based on shallow networks that increase the difficulty of learning sufficiently powerful feature representations of remote sensing images. Moreover, the features learned by an unsupervised framework might require longer codes for image representation to achieve satisfactory retrieval results. This approach will obviously reduce the image retrieval efficiency. Napoletano [29] conducted an extensive evaluation of visual descriptors for the content-based retrieval of Remote Sensing (RS) images, including global, local, and CNN features. The results demonstrated that CNN-based and local features have the best performance in different retrieval schemes. Zhou et al. [30] investigated the extraction of features from both fully-connected and convolutional layers for remote sensing image retrieval. They [29,30] employed only CNN models and performed fine-tuning on a public remote sensing image dataset for feature extraction. Intensive comparisons were conducted to evaluate the performances of different models. Although these methods have achieved good performance in certain domains via DCNN frameworks, the learned deep features have not been sufficiently evaluated or described.
Visualization studies [23,[31][32][33] have been conducted to better understand these deep features. Zeiler [23] developed deconvolutional networks to provide insights into the functions of intermediate feature layers and the operation of the classifier. However, deconvolutional networks do not always work well without max-pooling layers. Based on Zeiler's work, Springenberg [31] explored a guided backpropagation method that results in qualitative improvements. Zhou [34] developed a visualization method called class activation mapping that can localize objects. However, these methods only focus on the convolutional layers and ignore the fully connected layers, which play significant roles in feature representation. Dosovitskiy [32] and Mahendran [33] developed approaches to studying image representations by inverting deep features at different layers; however, these approaches show only image information rather than object-relating information preserved in the final deep features. Selvaraju [35] used the class-specific gradient information flowing into the final convolutional layer of a CNN to produce a coarse localization map of an object. However, this method can generate only a class-oriented visualization map based on the final classification score and is not applicable to non-classification tasks.
In this paper, we investigate and evaluate the performance of remote sensing image retrieval from a dimensional perspective. We analyze a series of different dimensional features that we call DCCs, which are extracted by improved classical CNNs. In addition, we also perform a PCA to compress the high dimensional feature codes learned by the DCNNs, a strategy that is referred to as Deep Principal Component Analysis (DPCA) in the retrieval experiments. Furthermore, we explore a new method for visualizing deep features to provide a better understanding of our learned DCC features. Compared with the known archives [1,29,36], a high-resolution remote sensing image archive with a much larger scale is used to train the CNNs. The CNNs are trained from scratch to acquire a powerful representation of the remote sensing images and explore the performance of the DCCs in the task of CB-HRRS-IR.
The main contributions of this paper are as follows: • We propose the extraction of DCCs for CB-HRRS-IR via two schemes. First, we extract a series of different dimensional DCCs that include a few tens to thousands of codes as the feature representation of remote sensing images. Second, PCA is introduced to compress the high-dimensional remote sensing image feature codes learned by traditional DCNNs. The lower-dimensional feature codes outperform the higher-dimensional ones, and the DCCs outperform the DPCA. In addition, we explore the FWM visualization method for deep features learned by DCNN frameworks, which can help us to better understand the differences between DCC features and the original deep features.

•
Compared with the fine-tuning methods of former studies, we train all the DCNNs from scratch to explore the performance of the DCCs in CB-HRRS-IR based on a large-scale remote sensing image archive. In addition, this large-scale high-resolution remote sensing image archive will be made publicly available to other researchers. We expect that the archive can serve as a standardized public dataset in this field and can help to advance the research in the remote sensing field.

Methods
In this section, we first review the off-the-shelf DCNN frameworks and then introduce the DCC feature extraction schemes evaluated in our work. Next, we present the evaluation protocols for the experimental results. Finally, we introduce our proposed visualization approach, which is designed to provide a better understanding of deep features.

DCNN Framework
A traditional DCNN framework usually consists of several different types of layers, including convolutional layers, pooling layers, and fully connected layers (Figure 1). In a convolutional layer, a certain number of convolutional kernels are used to generate feature maps from the previous layer. A pooling layer is applied to reduce the spatial dimensions of the feature map via an average or max pooling operation. One or several fully connected layers follow a convolutional layer or a pooling layer and constitute the final part of the DCNN framework. Note that the pixel values of a feature map and a fully connected layer are usually mapped via an activation function, such as Rectified Linear Units (ReLUs) [22], Leaky ReLU (LReLU) [37] and the improved Parametric ReLU (PReLU) [38]. These activation functions can improve a CNN framework's nonlinearity and effectively expedite the convergence of the training procedure and avoid overfitting; thus, they highly boost the framework's generalization capacity.

Feature Extraction Based on DCC
In general, a DCNN framework takes a raw image as input and processes it using a certain number of convolutional layers; then, it outputs feature maps of the original image in the final convolutional layer. The following fully connected layers then learn to convert the feature maps to a vector for image representation. In our opinion, the final fully connected layers can be regarded as an ordinary neural network for encoding the learned convolutional features. However, many studies have used high-dimensional codes (usually thousands of codes) in the fully connected layers for image representation for the tasks of object recognition, detection [22,23] and image retrieval [2,29,30]. Although these features are already rather compact compared with convolutional features, more compact features must be identified to further enhance the efficiency of the retrieval. Hinton [39] used neural networks for image data dimensionality reduction for the first time. Inspired by this work, we regard the fully connected layers in the DCNN framework to be ordinary neural networks that learn more compact codes (a few tens to thousands of codes) for image representation for the remote sensing image retrieval task.
In our experiments, two classical DCNN frameworks, i.e., Alexnet [22] and VGG-16 [40], are applied to compress the convolutional features. Alexnet includes five convolutional layers followed by three fully connected layers while VGG-16 contains 13 convolutional layers followed by three fully connected layers. The fully connected layers usually learn the high-level abstract feature representation of an image. Additionally, image features are often extracted from the second fully connected layer with high dimensionality. Based on our recognition process, the first fully connected layer (Fc1) encodes the learned feature map as a high dimensional feature vector. The second fully connected layer (Fc2) can then be used to learn more compact feature codes with lower dimensionality from Fc1. Finally, the compact feature codes are involved into a classifier in the third layer (Fc3). Therefore, the Fc2 layer is important for learning DCCs with different dimensionalities and can be regarded as a DCC learning layer (Figure 1). To evaluate the performance of DCC features in the application of remote sensing image retrieval, we set the dimensions of the DCC learning layer to 4096, 1024, 256, 64, and 32. Then, we extract a series of different dimensional deep compact feature codes to further explore the retrieval performance.

Feature Extraction Based on PCA
Although DCCs can be employed to replace the high-dimensional features extracted by the original DCNN frameworks, PCA, which is a classical data compression method with solid statistical foundations, is widely used in dimensionality reduction. To further evaluate the retrieval performance of the proposed DCC schemes, we also adopt PCA to compress the high-dimensional deep feature codes and compare the retrieval performance with that of the DCCs. Specifically, we have a set of n features {f 1 , f 2 , ..., f n }, f i ∈ R D , which are extracted by a raw DCNN framework and form the rows of the feature matrix F ∈ R D×n . Our goal is to acquire a compressed feature matrix F ∈ R C×n , where C denotes the length of the compressed feature codes. The basic principle of the PCA used to achieve this goal is to compress F via a projection operation, F = U T F, where U ∈ R D×C is the projecting matrix. U can be obtained using the following objective function: The constraint U T U = I requires the projecting vectors to be orthogonal to one another in such a way that the compressed feature vectors are pairwise decorrelated. Similar to the DCC learning scheme, f i is a deep feature that is extracted from the penultimate fully connected layer of an original DCNN framework. We compress f i to yield shorter feature codes with the same dimensionality as of the DCCs. In the following sections, we use DPCA to refer to the feature codes that are compressed versions of the original deep features of the PCA method.

Quantitative Evaluation
For the similarity measurements, three state-of-the art distances, which are the most commonly used for image retrieval, are applied: Manhattan distance, Euclidean distance and cosine distance. For the sake of efficiency, we adopt the Manhattan distance to identify the images that are similar to the query. Regarding the retrieval performance, the Precision (P), Recall (R) and mean Average Precision (mAP) are often employed to assess the retrieval results. Precision is defined as the fraction of the retrieved relevant images with respect to the query image, and recall is defined as the ratio of the number of retrieved relevant images to the total number of images that are relevant to the query image. Usually, only the top-k retrieved results are evaluated to determine their precision. The fraction of true relevant images in the top-k results (P@k) is calculated as follows: where σ(i) indicates the relevance between a query q and the i-th ranked retrieved image. Here, σ(i) ∈ {0, 1} is 1 if the i-th item is a relevant image and 0 otherwise. To assess the performance of the ranked retrieval results, an interpolated recall-precision curve can be plotted to compare the differences and determine the comprehensive performance of the retrieval schemes. Given a set of Q queries, the mAP can be defined by calculating the Average Precision (AP) for all queries: where the AP for each query q is defined as follows: where R represents the size of the test dataset.

Visual Evaluation
The point of our explored visualization method is to extract image features from the penultimate fully connected layer and then to employ the backpropagation algorithm to map back the feature codes onto the convolutional feature layer, which yields the FWM. Specifically, for an image, the output of the fully connected layer is seen as a feature vector, and each code of the vector can be regarded as the importance of the corresponding dimensionality in the feature space. Therefore, we backpropagate the extracted feature as weights of the convolutional layers, and a weighted sum of the convolutional feature maps is used to generate the final FWM. Let A k (x, y) represent the activation of the k-th feature map of the last convolutional layer at position (x, y). To obtain the importance of a feature code at the d-th dimension f d , which is learned from the activation A k (x, y) of a feature map, we first calculate the gradient of f d with respect to A k (x, y): Therefore, the whole contribution of A k (x, y) to the final feature can be calculated as follows: where D is the dimensionality of the output feature. Thus, for every activation at position (x, y) of a feature map, we can obtain its weight with respect to the final extracted feature using G k (x, y). However, the information contained in the final feature is not pure because of the image quality; thus, noise information is contained in the feature codes. As a result, this noise is also projected back to the weight of an activation A k (x, y) via the above operations. Considering this situation, we adopt the average weight of A k (x, y) as the final weight of the k-th feature map: where n is the number of pixels of the feature map. Usually, several feature maps correspond to the channels of the last convolutional layer. To generate the FWM for visualization, the weighted sum of these feature maps is calculated as follows: Based on this method, we can visualize the information that is contained in the feature maps and preserved in the final feature codes. This approach helps us further understand the learned DCC feature codes and compare them to the traditional codes. Thus, this visualization method is feature oriented and can be generalized to any layer of a CNN framework to provide a representation of the nature of the CNN framework.

Dataset
Extensive evaluations of retrieval performance were conducted using a large-scale high-resolution remote sensing image dataset composed of 25 classes of different scenes/objects. For each class, 500 images with the size of 256 by 256 pixels were manually collected, primarily from the World Map (World Map web site: http://map.tianditu.com/map/index.html.) website. The image classes are as follows: agricultural, airport, basketball court, bridge, building, container, fishpond, footbridge, forest, greenhouse, intersection, oiltank, overpass, parking lot, plane, playground, residential, river, ship, solar power area, square, tennis court, water, wharf, and workshop. Specifically, the image resolution of each class is 0.6 m per pixel, except for the airport class. Most of the images were collected from the 18-level remote sensing image found on the World Map. However, an image that is 256 by 256 pixels might be too small to contain a large remote sensing scene, such as an airport or an overpass. As a compromise, we collected airport images with a 2.4 m per pixel resolution that include 16-level remote sensing images from World Map; these images can clearly reflect the properties of an airport. For the overpass, we collected corresponding images with a resolution of 1.2 m that contain the main parts of an overpass. Additionally, a small number of plane images were collected from Google Maps as a supplement; the resolution was the same as that for the images from World Map. All the collected images are in the Red-Green-Blue (RGB) color space. Four samples of each class are shown in Figure 2. As explained above, our dataset contains large-scale remote sensing images that vary in resolution, image size, and source. Each of these factors pose challenges to the comprehensive retrieval performance of our experiments. This remote sensing image dataset will be made publicly available to other researchers, and we expect that it will greatly promote research in the remote sensing community, including research related to remote sensing image classification, scene classification, object detection/recognition, and image retrieval.

Experimental Setup
As introduced in the previous section, we evaluated the retrieval effect using the popular Alexnet [22] and VGG-16 [40] CNNs, and we set the penultimate fully connected layer to be of various dimensionalities (4096, 1024, 256, 64 and 32) to obtain the DCCs. For the dataset, we randomly selected 250 images for each class from the whole dataset as a training dataset and then used the remainder as the test dataset. For the training dataset, we randomly selected 200 images of each class as training samples with the remaining 50 images of each class being used for the validation dataset. Note that we only selected less than half of the images to train the CNNs, which is different from [30], who adopted most of their images from the dataset for the CNNs training. Thus, a situation in which a small number of the samples is used for training can result in overfitting because of the large-scale parameters in the CNNs. To overcome this problem, we employed simple data augmentations to enrich the training dataset. Specifically, the form of the data augmentation consisted of the generation of image horizontal reflections and rotations with degrees of 90, 180 and 270. Compared with previous studies [2,29,30] that trained the CNN models by fine-tuning the pretrained CNN models based on natural images, we trained the CNNs from scratch using high-resolution remote sensing images to avoid the latent differences between the natural images and remote sensing images.
For the required input dimensionality, all images were resized to 227 by 227 pixels for Alexnet and 224 by 224 pixels for VGG-16, as well as their corresponding transformation networks. In addition, the mean values were extracted from the training samples. Following the previous works [22,23], the weights were set using a Gaussian distribution for Alexnet and "Xavier" for VGG-16. We set the learning rate to 0.01 for Alexnet and 0.001 for VGG-16. The batch size was set to 256 for Alexnet and 48 for VGG-16. For a fair comparison, the transformation CNNs were initialized in a manner similar to that of their corresponding original CNNs. All experiments were conducted using the Ubuntu-16.04 system and two Nvidia GTX Titan X GPUs with 12GB RAM.

Results and Analyses
In this section, we compare the performances of the DCC and DPCA methods for different dimensions. The top-100 retrieval precision results of each class for different methods are evaluated. The top-150 retrieval precision results for each class are also shown for further evaluation. For convenience, we use "ADCC# "to represent the names of the DCC frameworks: it denotes DCCs with a dimensionality of # learned using the "Alexnet" framework. The same method is also applied to the corresponding VGG frameworks. Note that the length of the features extracted from the original Alexnet and VGG frameworks is 4096. Similarly, we use "APCA#" to represent the names of the DPCA learning methods: it denotes image feature codes learned by the original "Alexnet" framework and then compressed to dimension of # via the PCA method. A similar method is also used for the corresponding VGG frameworks.

Retrieval Performance of DCCs
In this section, the image retrieval performance of the DCC method is evaluated. The top-100 retrieval precision results for each class are shown in Table 1. In Table 1, there are two results reported in bold in each row. The first result is the best retrieval result for Alexnet and the corresponding DCC models; the second one is the best retrieval result for VGG and the corresponding models. As shown in Table 1, most of the best results for Alexnet and its DCC models are obtained using ADCC64 and ADCC32, and all the best results are obtained using our DCC models. Specifically, the findings indicate that precision is greatly increased by our DCC models for all classes compared with results of the original Alexnet model. The river class in the ADCC64 model shows the largest increase in precision, with a significant improvement of nearly 20%. Moreover, nearly all our DCC models achieve obvious improvements in the retrieval results for each class compared with the results of the original Alexnet model. For VGG and its corresponding DCC models, most of the best results are achieved using VDCC64, and there is also a high improvement in the retrieval precision for each class compared with the original VGG model. Several of the best retrieval results are not obtained using VGG64, and the precision achieved using VDCC64 is comparable with the best results. Additionally, the best improvement is observed for the wharf class, for which the precision increases from 51.02% to 75.32%. Thus, the precision of the VDCC64 models is more than 24% higher than that of the original VGG model. Compared with the ADCC32 model, which achieved several of the best retrieval results, the VDCC32 model does not generate the best retrieval results for any class, and certain classes appear to show decreases in retrieval precision. Nevertheless, the VDCC32 model achieves prominent improvements in the retrieval precision for 60% of the classes and certain improvements in retrieval precision are noticeable compared with the results of the original VGG model, such as for the wharf and building classes, which showed precision improvements of 18.66% and 14.95%, respectively. Table 2 shows the top-150 retrieval precision results for all classes. The retrieval performance based on the P@150 values shows a similar trend as that based on the P@100 values. The greatest improvements in precision are observed for the river class using the ADCC64 model and the wharf class using the VDCC64 model. Taken together, all our DCC models can generate significant improvements in the retrieval precision compared with their corresponding original frameworks.
For Alexnet and its DCC models, several of the best results are obtained using ADCC32, and the corresponding results obtained using ADCC64 are similar to those of ADCC32. For the VDCC32 model, a sharp decline in the precision is observed compared with that of the VDCC64 model. These results indicate that image features with a dimensionality of 64 may be the most optimal for adapting to the image representation. To comprehensively assess the performances of different models, we compare the model accuracies of these methods, which are calculated according to the mean of retrieval precisions of each class achieved by the CNN model. The model accuracies based on the P@100 and P@150 values are listed in Tables 3 and 4, respectively. Additionally, the best results are reported in bold. Tables 3 and 4 show that the model accuracy clearly increases as the feature dimensionality is reduced from 4096 to 64. ADCC64 and VDCC64 achieve the best results at both the P@100 and P@150 levels, and remarkable accuracy improvements can be observed. Specifically, at the P@100 level, ADCC64 and VDCC644 show accuracy improvements of 8.81% and 8.15% compared with the results of the corresponding original CNN frameworks, respectively. At the P@150 level, ADCC64 and VDCC644 show accuracy improvements of 9.37% and 9.06% compared with the results of the baseline CNN frameworks, respectively. These findings are in accordance with the class precision results shown in Tables 1 and 2. In Tables 3 and 4, each row shows approximate accuracy improvements as the dimensions of the DCCs decrease from 4096 to 1024, 256 and 64. Thus, the efficacy of our proposed DCC method is demonstrated by the performances of different types of CNN frameworks. Regarding the 32-dimensional feature codes, both ADCC32 and VDCC32 show decreases in model accuracy compared with the corresponding 64-dimensional feature codes, especially VDCC32. This finding indicates that 64-dimensional feature codes are more effective for image representation than feature codes with other dimensions, which confirms our former hypothesis. Moreover, even when decreases in accuracy are observed for the 32-dimensional features compared with 64-dimensional features, the ADCC32 and VDCC32 models still achieve significantly higher accuracies than those of the Alexnet and VGG models, respectively.
Specifically, a comparison of the same dimensional feature codes from different frameworks shows that VGG and its corresponding DCC models achieve higher accuracies than those of Alexnet and its corresponding DCC models. This finding makes sense because VGG and its corresponding DCC models are deep CNN frameworks, while Alexnet and its corresponding DCC models are shallow CNN frameworks. Nevertheless, a comparison of the results of the DCC models of Alexnet with those of the original VGG model shows that the accuracies of the DCC models are considerably higher than those of the VGG model, especially for the Alexnet64 and Alexnet32 models, which present accuracy improvements of 6.89% and 6.30%, respectively. These findings demonstrate that the feature codes with low dimensionality learned by our DCC models have more powerful image representation abilities compared with the high-dimensional features codes. This finding reveals that high-dimensional features learned by traditional CNN frameworks do not always have the best image representation abilities, whereas the lower-dimensional features that can be learned by our DCC method can greatly improve upon the performance of the original frameworks. More importantly, these results indicate that our proposed DCC method can achieve a better performance using a shallow CNN framework rather than a deep CNN framework. The superiority of this approach is obvious. Our proposed DCC method is easy to use, and it can also reduce the training time requirements and improve the convenience of the applications. This discovery can also be employed for many other tasks, such as image classification and object detection, to further advance the performance and efficiency of the method because of the shorter and more powerful feature representation capacity.   For further evaluation, we employ a confusion matrix to show the classification results for 64and 4096-dimensional features extracted from different models, as shown in Tables 5-8. For simplicity, let C1, C2, ..., C25 represent agricultural, airport, basketball court, bridge, building, container, fishpond, footbridge, forest, greenhouse, intersection, oiltank, overpass, parking lot, plane, playground, residential, river, ship, solar power area, square, tennis court, water, wharf and workshop, respectively. Each column in Tables 5-8 corresponds to the prediction result, while each row represents the actual class. Note that the last column and row correspond to the classification and prediction precision, respectively. In addition, the overall accuracy is reported in bold in the right-bottom cell. As seen from Tables 5 and 6, even though there are slight decreases in classification precision for several classes, most of the classes experienced improvements in classification precision when the ADCC64 model was used. In addition, the precision improvements are more obvious compared to the decreases in classification precision, such as for river (C18), square (C21), tennis court (C22), which achieved a 9.60%, 6.80%, and 7.60% higher classification precision, respectively, when our DCC method was used, respectively. In addition, the prediction precision of each class shows a similar trend. The intersection class achieved a 10.70% improvement in prediction precision when the ADCC64 model was used. For the VGG and VDCC64 models (Tables 7 and 8), most classes experienced precision improvements in both classification and prediction when our DCC method was used. However, the playground (C16) class suffered a 5.60% decrease in classification precision when the VDCC64 model was used. This result occurred because some playground samples are more easily classified as basketball court (C3) samples, as shown in Table 8. In addtion, some tennis court (C22) samples are also classified wrongly as basketball court samples. This led to a 3.78% decrease in the prediction precision of the basketball court class. The main reason for this phenomenon is that there is a certain similarity between the backgrounds of the tennis court, playground and basketball court samples to some extent. The VDCC64 model is more capable of discriminating the basketball court class, as the classification precision for the basketball court class using this model experienced a 10.80% improvement. This result implies that the VDCC64 model can learn to discriminate the basketball court class from other classes. It also reveals that VDCC64 has some deficiencies in discriminating similar classes. Nevertheless, our DCC method can achieve a higher overall classification accuracy. For the ADCC64 model, the overall classification accuracy is 86.56%, which is nearly 2.00% higher than that of the original Alexnet model (84.58%). In addition, the VGG64 model achieves an approximately 1.50% improvement in overall classification accuracy (88.37%) compared with the original VGG model (86.88%). In general, our DCC models can achieve better classification results for most classes.  Class  C1  C2  C3  C4  C5  C6  C7  C8  C9  C10  C11  C12  C13  C14  C15  C16  C17  C18  C19  C20  C21  C22  C23  C24 Table 6. Confusion matrix of classification results based on ADCC64. C1  C2  C3  C4  C5  C6  C7  C8  C9  C10  C11  C12  C13  C14  C15  C16  C17  C18  C19  C20  C21  C22  C23  C24  C25  P(%)   C1  204  4  2  0  0  0  5  5  0  3  0  0  6  1  1  1  0  3  0  5  2  3  5  0  0  81.60  C2  4  204  2  0  1  1  0  0  0  2  3  2  13  1  5  0  0  3  0  1  6  2  0  0  0  81.60  C3  0  0  161  0  5  0  0  1  0  0  0  0  0  7  0  19  4  0  0  0  7  43  0  3  0 Table 7. Confusion matrix of classification results based on VGG.   Table 8. Confusion matrix of classification results based on VDCC64. C1  C2  C3  C4  C5  C6  C7  C8  C9  C10  C11  C12  C13  C14  C15  C16  C17  C18  C19  C20  C21  C22  C23  C24

Retrieval Performance of the DPCA
In this section, we show the retrieval performance achieved using the DPCA scheme described in Section 3. Tables 9 and 10 show the top-100 and top-150 retrieval precision results for each class, respectively. The results show that when we compress the original deep feature codes to dimensions of 1024, 256 and 64, only a few classes show improved retrieval precision. The 32-dimensional features show better retrieval performance for certain classes. The wharf class shows the greatest increase in precision when the original deep features are compressed via the PCA method. Specifically, at the P@100 level, an 8.45% precision improvement is achieved by compressing the deep features from the Alexnet model and a 17.57% precision improvement is achieved by compressing the deep features from the VGG model, whereas at the P@150 level, 8.75% and 18.33% precision improvements are achieved by compressing the deep features from the Alexnet and VGG models, respectively. However, Tables 9 and 10 show that the improvements in retrieval performance are limited compared with the results based on the original deep features. Many of the best retrieval results are obtained using the original deep features. In other words, the retrieval performance decreases for certain classes when extracting the feature codes via the PCA method. This finding indicates that certain important feature information can be lost while compressing the deep features to lower dimensionalities. A comparison of the information in Tables 1 and 2 shows that our DCC method can learn shorter feature codes and achieve much better retrieval performance.  Tables 11 and 12 show the comprehensive retrieval accuracies for the different dimensional features for the entire dataset. As shown in Tables 11 and 12, sharp decreases in the comprehensive retrieval performance are observed when compressing the original deep features to different dimensional features. Note that the reduction in precision declines as the dimensionality is reduced. When deep features are compressed to a dimensionality of 32, retrieval precision improvements occur; this finding corresponds to the information presented in Tables 9 and 10. Based on this observation, we also calculate the retrieval accuracies achieved when the deep features are compressed to a dimensionality of 16. Specifically, at the P@100 level, 16-dimensional features compressed from the deep features of the Alexnet framework achieve a retrieval precision of 69.05%, which is 0.30% lower than the precision of the original deep features. With regard to the deep features of the VGG framework, the compressed 16-dimensional features achieve a retrieval precision of 73.01%, which is 1.73% higher than the precision of the original deep features. The P@150 level generally shows the same results. Nevertheless, all these results show reduced performance compared with that of the compressed 32-dimensional features, as shown in Tables 11 and 12. A comparison of our DCC method, which can achieve significant improvements in the retrieval performance for all the different features at lower dimensionalities, with the DPCA method indicates that the DPCA method only achieves limited improvements in retrieval performance for the 32-and 16-dimensional features. This phenomenon mainly occurs because the PCA method is a linear compression scheme, while the features of remote sensing images should have non-linear relations. This finding indicates that our proposed DCC method is an efficient method for learning more powerful image feature representations with lower dimensionalities. Figure 3 shows a query image of a wharf and the corresponding top-5 irrelevant retrieval images obtained using the Alexnet based DCC and DPCA methods. The results show that the orders of irrelevant images obtained using the ADCC32, ADCC64 and ADCC256 models are much greater than those obtained using other models at the same position, which indicates that our proposed DCC method has a better retrieval performance. Specifically, ADCC64 shows the best results, which is consistent with the results shown in Tables 1-4 and prior analyses.

Comparison
In this section, we compare the performances of the DCC and DPCA methods. The mAP results for the evaluated methods are listed in Table 13, which shows that all DCC frameworks substantially outperform the baseline CNN frameworks. Compared with the baseline frameworks, the 64-dimensional features extracted by our proposed DCC method based on the Alexnet or VGG frameworks achieve the best results. Specifically, absolute mAP increases of 8.51% and 8.64% are observed for the 64-dimensional features using the Alexnet-and VGG-based DCC frameworks, respectively. For the DPCA method, only 32-dimensional features compressed from the deep features of the VGG framework achieve better mAP performance compared with the baseline framework. However, only the VPCA32 features can improve the performance of the image retrieval. In contrast, the 32-dimensional features extracted by our proposed DCC method can highly outperform those of the DPCA method. Furthermore, the mAP values of all DCC frameworks are greater than those of the DPCA method, as shown in Table 13. The recall-precision curves of the different DCC and DPCA methods are plotted in Figure 4. As shown, low-dimensional features obtained using our proposed DCC method outperform those of the baseline frameworks (Figure 4a,b), which is also consistent with Tables 3 and 13. Specifically, for Alexnet-based frameworks, 64-dimensional features have the best performance, with high recall and precision retrieval results (Figure 4a). Note that the 32-dimensional DCC features based on Alexnet achieve much better results at lower recall levels than the corresponding baseline features, which is very desirable for precision-oriented image retrieval. For high recall levels, the 256-dimensional features have the best performance. Therefore, this amount is suitable for recall-oriented image retrieval. For VGG-based frameworks, the 64-dimensional features also outperform all the other dimensional features and show comparative results at high recall levels, even when compared with the 256-dimensional features (Figure 4b). For the DPCA method, only 32-dimensional features show comparative results when compared with the original deep features (Figure 4c,d). In general, the 64-dimensional features based on our proposed DCC frameworks show the best results, and they dramatically improve the retrieval performance. A further comparison based on the above analyses is performed by plotting the best results of each framework in Figure 5. The results reveal that the 64-dimensional features extracted by our DCC frameworks significantly outperform those of the baseline frameworks and the DPCA method, which demonstrates the effectiveness and practicability of our proposed DCC method for remote sensing image retrieval. Specifically, the findings indicate that the 64-dimensional DCC features based on the Alexnet and the VGG frameworks generally have the same performance. Note that Alexnet-based DCC models are shallow frameworks, whereas VGG-based models are much deeper. This comparison shows that our proposed DCC method can use a shallow CNN framework to realize a performance comparable to that achieved using a much deeper CNN framework. Furthermore, the 64-dimensional DCC features based on the Alexnet framework achieve much better retrieval results than the original VGG framework, which is also consistent with the results shown in Tables 3 and 13. This finding reveals that our proposed DCC method can greatly improve upon the performance of a shallow CNN framework and can be used to obtain a greater precision than that achieved using a deeper CNN framework, which shows the advantages of improving storage and efficiency simultaneously.

Visual Understanding of the DCC
The FWM can help us better understand the nature of DCC features via visualization. Based on previous results and analyses, the ADCC64 and VDCC64 models show the best retrieval performances, which demonstrates that the 64-dimensional DCC features can represent an image more appropriately than the 4096-dimensional features extracted from traditional DCNN frameworks. Therefore, the ADCC64 and VDCC64 models are selected for the weighted feature visualization. We also show the FWM based on the traditional Alexnet and VGG DCNN frameworks for comparison. Specifically, 10 samples of 25 classes from the dataset are randomly selected for visualization. The results of the FWMs are shown in Figure 6, which indicates that our proposed FWM method can successfully illustrate the regions of objects in an image. Red regions may include intersections with objects in the images, which demonstrates the efficiency of our proposed FWM visualization method. Simultaneously, red regions in the FWMs indicate the features that are focused on by the DCNN frameworks; this information is preserved in the final deep features. This arrangement helps explain why the learned deep features provide accurate representations of the images.
A comparison of the FWMs based on the ADCC64 and VDCC64 models with those based on the baseline CNN frameworks shows that the former frameworks could indicate the localization and extent of objects more precisely in the images for certain classes, such as playground, parking lot and tennis court. However, for the FWMs based on the original Alexnet and VGG frameworks, a greater amount of information is scattered in a messy manner, which could represent the noisy information of objects. This information can also be learned and preserved in the final feature codes; thus, it may have a negative influence on the retrieval performance. This circumstance helps us intuitively understand how our proposed DCC frameworks can significantly outperform the traditional Alexnet and VGG frameworks for these classes. This approach also corresponds to the retrieval precision results shown in Table 1. Although noisy information is shown in the FWM of the footbridge class, our proposed DCC methods tend to focus more precisely on the position and region of the footbridge. For the building class, the DCC frameworks tend to learn more aggregated information from the building top, which is much different from the scattered information found by the original frameworks, especially for the VGG model. Similarly, the VDCC64 model remarkably improves the retrieval performance of the building class by 17.10% compared with the original VGG model. For the plane class, the DCC method can learn more information from different parts of a plane. Regarding the FWM of the oil tank, the DCC and traditional frameworks can distinguish all the oil tanks in the image. Note that the VDCC64 model has better results compared with the VGG model. For the airport and bridge classes, all CNN frameworks tend to learn the background information of the objects rather than the airport runways or bridge bodies. This tendency means that the CNN frameworks attempt to discern the objects from their backgrounds. The DCNN frameworks do not always learn object information, which is similar to human cognition. Additionally, the background information is vital to class discrimination information learning. Based on these observations, the proposed FWM visualization method based on our DCC frameworks should also be applicable to class discrimination, scene recognition and other non-classification tasks. Figure 6. Feature Weighted Maps of the DCC models. The first column of each row is the original image. The second to the fourth columns are the FWMs extracted by the ADCC64, Alexnet, VDCC64 and VGG models, respectively.

Conclusions
In this work, we propose learning DCCs for CB-HRRS-IR. Extensive experiments are conducted to learn feature codes with dimensionalities extending from tens to thousands for image retrieval based on a large-scale remote sensing image archive. A PCA is also employed to compress the deep features. The experimental results reveal that our proposed DCC method can remarkably outperform traditional DCNN frameworks and the DPCA method. Additionally, the 64-dimensional DCC features yield the best retrieval results. To further understand the learned deep features, we explore a feature-oriented visualization method FWM, which demonstrates that our proposed DCC method can learn more powerful information for image representation. This feature-oriented visualization method can also be generalized to any CNN framework and generate FWMs from any layer for a classification-oriented or non-classification task.