Open Access This article is
- freely available
Remote Sens. 2017, 9(5), 489; https://doi.org/10.3390/rs9050489
Learning Low Dimensional Convolutional Neural Networks for High-Resolution Remote Sensing Image Retrieval
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
Electrical Engineering and Computer Science, University of California, Merced, CA 95343, USA
Author to whom correspondence should be addressed.
Academic Editors: Gonzalo Pajares Martinsanz and Prasad S. Thenkabail
Received: 16 April 2017 / Accepted: 14 May 2017 / Published: 17 May 2017
Learning powerful feature representations for image retrieval has always been a challenging task in the field of remote sensing. Traditional methods focus on extracting low-level hand-crafted features which are not only time-consuming but also tend to achieve unsatisfactory performance due to the complexity of remote sensing images. In this paper, we investigate how to extract deep feature representations based on convolutional neural networks (CNNs) for high-resolution remote sensing image retrieval (HRRSIR). To this end, several effective schemes are proposed to generate powerful feature representations for HRRSIR. In the first scheme, a CNN pre-trained on a different problem is treated as a feature extractor since there are no sufficiently-sized remote sensing datasets to train a CNN from scratch. In the second scheme, we investigate learning features that are specific to our problem by first fine-tuning the pre-trained CNN on a remote sensing dataset and then proposing a novel CNN architecture based on convolutional layers and a three-layer perceptron. The novel CNN has fewer parameters than the pre-trained and fine-tuned CNNs and can learn low dimensional features from limited labelled images. The schemes are evaluated on several challenging, publicly available datasets. The results indicate that the proposed schemes, particularly the novel CNN, achieve state-of-the-art performance.
Keywords:image retrieval; deep feature representation; convolutional neural networks; transfer learning; multi-layer perceptron
With the rapid development of remote sensing sensors over the past few decades, a considerable volume of high-resolution remote sensing images are now available. The high spatial resolution of the images makes detailed image interpretation possible, enabling a variety of remote sensing applications. However, efficiently organizing and managing the huge volume of remote sensing data remains a great challenge in the remote sensing community.
High-resolution remote sensing image retrieval (HRRSIR), which aims to retrieve and return images of interest from a large database, is an effective and indispensable method for the management of the large amount of remote sensing data. An integrated HRRSIR system roughly includes two components, feature extraction and similarity measure, and both play an important role in a successful system. Feature extraction focuses on the generation of powerful feature representations for the images, while similarity measure focuses on feature matching to determine the similarity between the query image and other images in the database.
We focus on feature extraction in this work since retrieval performance largely depends on whether the extracted features are representative. Traditional HRRSIR methods are mainly based on low-level feature representations, such as global features including spectral features , shape features , and especially texture features [3,4,5], which have been shown to be able to achieve satisfactory performance. In contrast to global features, local features are extracted from image patches centered at interesting or informative points and thus have desirable properties such as locality, invariance, and robustness. Remote sensing image analysis has benefited a lot from these desirable properties, and many methods have been developed for remote sensing registration and detection tasks [6,7,8]. In addition to these tasks, local features have also proven to be effective for HRRSIR. Yang et al.  investigated local invariant features for content-based geographic image retrieval for the first time. Extensive experiments on a publicly available dataset indicate the superiority of local features over global features such as simple statistics, color histogram, and homogeneous texture. However, both the global and local features mentioned above are low-level features. Moreover, they are hand-crafted, which likely limits how optimal they are as powerful feature representations.
Recently, deep learning methods have dramatically improved the state-of-the-art in speech recognition as well as object recognition and detection . Content-based image retrieval has also benefited from the success of deep learning . Researchers have explored the application of unsupervised deep learning methods for remote sensing recognition tasks such as scene classification  and image retrieval [13,14]. Unsupervised feature learning  has been used to learn sparse feature representations from remote sensing images directly for HRRSIR, but the performance is only slightly better than state-of-the-art. This is because the unsupervised learning framework is based on a shallow auto-encoder network with a single hidden layer making it incapable of generating sufficiently powerful feature representations. Deeper networks are necessary to generate powerful feature representations for HRRSIR.
Convolutional neural networks (CNNs), which consist of convolutional, pooling, and fully-connected layers, have already been regarded as the most effective deep learning approach to image analysis due to their remarkable performance on benchmark datasets such as ImageNet . However, large numbers of labeled training samples as well as “tricks” to prevent overfitting are needed to train effective CNNs. In practice, a common strategy to address the training data issue is to transfer deep features from CNN models trained on problems for which there is enough labeled data, such as the ImageNet dataset for object recognition, and then apply them to the problem at hand such as scene classification [16,17,18] and image retrieval [19,20,21,22]. Alternately, in a recent work on transfer learning , instead of transferring the features from a pre-trained CNN, the authors use the scores obtained by the pre-trained deep neural networks to train a support vector machine (SVM) classifier. However, whether the deep feature representations extracted from such pre-trained CNN models can be used for HRRSIR remains an open question. The limited work on using pre-trained CNNs for remote sensing image retrieval  only considers features extracted from the last fully-connected layer. In contrast, we perform a thorough investigation on transfer learning for HRRSIR by considering features from both the fully-connected and convolutional layers, and from a wide range of CNN architectures. We further fine-tune the pre-trained CNNs to learn domain-specific features as well as propose a novel CNN architecture which has fewer parameters and learns low dimensional features from limited labelled images.
The main contributions of this paper are as follows:
- We propose two effective schemes to extract powerful deep features using CNNs for HRRSIR. In the first scheme, the pre-trained CNN is regarded as a feature extractor, and in the second scheme, a novel CNN architecture is proposed to learn low dimensional features.
- A thorough, comparative evaluation is conducted for a wide range of pre-trained CNNs using several remote sensing benchmark datasets. Three new challenging datasets are introduced to overcome the performance saturation of the existing benchmark dataset.
- The novel CNN is trained on a large remote sensing dataset and then applied to the other remote sensing datasets. The results show that replacing the fully-connected layers with a multi-layer perceptron not only decreases the number of parameters but also achieves remarkable performance with low dimensional features.
- The two schemes achieve state-of-the-art performance, establishing baseline results for future work on HRRSIR.
The remainder of the paper is organized as follows. An overview of CNNs, including the various pre-trained CNN models we consider, is presented in Section 2 and our specific HRRSIR methods are described in Section 3. Experimental results and analysis are presented in Section 4, and Section 5 includes a discussion. Section 6 concludes the findings of this study.
2. Convolutional Neural Networks (CNNs)
In this section, we first briefly introduce the typical architecture of CNNs and then review the pre-trained CNN models evaluated in our work.
2.1. The Architecture of CNNs
The main building blocks of a CNN architecture consist of different types of layers including convolutional layers, pooling layers, and fully-connected layers. There are generally a fixed number of filters (also called kernels, weights) in each convolutional layer which can output the same number of feature maps by sliding the filters through feature maps of the previous layer. The pooling layers perform subsampling along the spatial dimensions of the feature maps to reduce their size via max or average pooling. The fully-connected layers follow the convolutional and pooling layers. Figure 1 shows the typical architecture of a CNN model. Note that generally the element-wise rectified linear units (ReLU), i.e., , are applied to feature maps of both the convolutional and fully-connected layers to generate non-negative features. It is a commonly used activation function in CNN models due to its demonstrated improved performance over other activation functions [24,25].
2.2. The Pre-Trained CNN Models
Several successful CNN models pre-trained on ImageNet are evaluated in our work, namely the famous baseline model AlexNet , the Caffe reference model (CaffeRef) , the VGG network , and the VGG-VD network .
AlexNet is regarded as a baseline model, as it achieved the best performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012). The success of AlexNet is attributed to the large-scale labelled dataset, and techniques such as data augmentation, ReLU activation function, and dropout to reduce overfitting. Dropout is usually used in the first two fully-connected layers to reduce overfitting by randomly setting the output of each hidden neuron to zero with probability 0.5 . AlexNet contains five convolutional layers followed by three fully-connected layers, providing guidance for the design and implementation of subsequent CNN models.
CaffeRef is a minor variation of AlexNet and is trained using the open-source deep learning framework Convolutional Architecture for Fast Feature Embedding (Caffe) . The modifications of CaffeRef lie in the order of pooling and normalization layers as well as data augmentation strategy. It achieved similar performance on ILSVRC-2012 to AlexNet.
The VGG network includes three CNN models: VGGF, VGGM, and VGGS. These models explore different accuracy/speed trade-offs on benchmark datasets for image recognition and object detection. They have similar architectures with variations in the number and sizes of filters in the convolutional layers. The width of the final hidden layer determines the dimension of the feature representation when these models are used as feature extractors. This is equal to 4096 dimensions for the default model. Three variants models, VGGM128, VGGM1024, and VGGM2048, with narrower final hidden layers are used to investigate the effect of feature dimension. In order to speed up training, all the layers except for the second and third layer of VGGM are kept fixed during the training of these variants. These variants generate 128, 1024, and 2048 dimensional feature vectors, respectively.
Table 1 summarizes the different architectures of these CNN models. We refer the reader to the relevant papers for more details.
VGG-VD is a very deep CNN network including VD16 (16 weight layers including 13 convolutional layers and 3 fully-connected layers) and VD19 (19 weight layers including 16 convolutional layers and 3 fully-connected layers). These two models were developed to investigate the effect of network depth on large-scale image recognition task. It has been demonstrated that the representations extracted by VD16 and VD19 generalize well to datasets other than those on which they were trained.
3. Deep Feature Representations for HRRSIR
In this section, we present the two proposed schemes for HRRSIR in detail. The package MatConvNet (http://www.vlfeat.org/matconvnet/) is used for the proposed schemes . The pre-trained CNN models we use are available at (http://www.vlfeat.org/matconvnet/pretrained/).
3.1. First Scheme: Features Extracted by the Pre-Trained CNN without Labelled Images
It is not possible to train an effective CNN without a sufficient, usually large number of labelled images. However, many works have shown that the features extracted by a pre-trained CNN generalize well to datasets from different domains. Therefore, in the first scheme, we regard pre-trained CNNs as feature extractors. This does not require any labelled remote sensed images for training.
The deep features are extracted directly from specific layers of the pre-trained CNN models. In order to improve performance, preprocessing steps such as data augmentation and mean subtraction are widely used. Data augmentation is a commonly used technique to augment the training samples through image cropping, flipping, and rotating, etc. Mean subtraction subtracts the average image computed over all the training samples. This speeds up the convergence of the network during training. In our work, we just conduct mean subtraction with the mean value provided by corresponding pre-trained CNN.
3.1.1. Features Extracted from Fully-Connected Layers
Though there are three fully-connected layers in a pre-trained CNN model, the last layer (Fc3) is usually fed into a softmax (normalized exponential) activation function for classification. Therefore, the first two layers (Fc1 and Fc2) are used to extract features in this work, as shown in Figure 2. Fc1 and Fc2 each generate a 4096-D dimensional feature representation for all the evaluated models except for the three variants of VGGM. These 4096-D feature vectors can be directly used for computing the similarity between images for image retrieval.
3.1.2. Features Extracted from Convolutional Layers
Fc features can be considered as global features to some extent, while previous works have demonstrated that local features have better performance than global features when used for HRRSIR [9,32]. Therefore, it is important to investigate whether CNNs can generate local features and how to aggregate these local descriptors into a compact feature vector. There has been some work that investigates how to generate compact features from the activations of the fully-connected layers  and the convolutional layers .
Feature maps of the current convolutional layer are computed by sliding the filters over the output feature maps of the previous layer with a fixed stride, and thus each unit of a feature map corresponds to a local region of the image. To compute the feature representation of this local region, the units of these feature maps need to be recombined. Figure 2 illustrates the process of extracting features from the last convolutional layer (e.g., Conv5 layer in this case). The feature maps are firstly flattened to obtain a set of feature vectors. Each column then represents a local descriptor which can be regarded as the feature representation of the corresponding image region. Let and be the number and the size of feature maps, respectively. The local descriptors can be defined by:where is an n-dimensional feature vector representing a local descriptor.
The local descriptor set is of high dimension, thereby using it directly for similarity measure is not possible. We therefore utilize bag of visual words (BOVW) , vector of locally aggregated descriptors (VLAD) , and improved fisher kernel (IFK)  to aggregate these local descriptors into a compact feature vector. BOVW is extracted by quantizing local descriptors into visual words in a dictionary which is generally formed using clustering algorithms (e.g., K-means). IFK uses Gaussian mixture models to encode local feature descriptors, which are formed by concatenating the partial derivatives of the mean and variance of the Gaussian functions. VLAD is a simplification of IFK which uses the non-probabilistic K-means clustering to generate the dictionary. The differences between each local descriptor and its nearest neighbor in the dictionary are accumulated to obtain the feature vector.
3.2. Second Scheme: Learning Domain-Specific Features with Limited Labelled Images
Though pre-trained CNNs have been shown to generalize well to datasets from domains different than on which they were trained, we here ask the question: can we improve the performance of the pre-trained CNN if we have limited labelled images? In the second scheme, we propose two approaches to solve this problem.
3.2.1. Features Extracted by Fine-Tuned CNNs
The first approach is to fine-tune the CNNs pre-trained on ImageNet using the target remote sensing dataset. This will adjust the trained parameters to better suit the target dataset. Figure 3 shows the flowchart of fine-tuning the pre-trained CNN on the target dataset. The weights of the pre-trained CNN can be directly transferred to the fine-tuned CNN as initialization for training.
The pre-trained and fine-tuned CNN models have the same number of convolutional and fully-connected layers but differ greatly in the number of outputs in the Fc3 layer. The last fully-connected layer (Fc3) of a CNN model is used for classification, thus the number of units in this layer is equal to the number of image classes in the dataset.
3.2.2. Features Extracted by the Novel Low Dimensional CNN (LDCNN)
In the first scheme, pre-trained CNNs are used as feature extractors to extract Fc and Conv features for HRRSIR, but these models are trained on ImageNet, which is very different from remote sensing images. In practice, a common strategy for this problem is to fine-tune the pre-trained CNNs on the target remote sensing dataset to learn domain-specific features. However, the deep features extracted from the fine-tuned Fc layers are usually 4096-D feature vectors, which are not compact enough for large-scale image retrieval. Further, the Fc layers are prone to overfitting because most of the parameters lie in Fc layers. In addition, the convolutional filters in CNN are generalized linear models (GLMs) based on the assumption that the features are linearly separable, while features that achieve good abstraction are generally highly nonlinear functions of the input.
Network in Network (NIN) , which is a stack of several mlpconv layers, has therefore been proposed to overcome these limitations. In NIN, the GLM is replaced with an mlpconv layer to enhance model discriminability and the conventional fully-connected layers are replaced with global average pooling to directly output the spatial averages of the feature maps from the last mlpconv layer which are then fed into the softmax layer for classification. We refer the reader to  for more details.
Inspired by NIN, we propose a novel CNN that has high model discriminability but generates low dimensional features. Figure 4 shows the overall structure of the low dimensional CNN (LDCNN) which consists of five conventional convolution layers, an mlpconv layer, a global average pooling layer, and a softmax classifier layer. LDCNN is essentially the combination of a conventional CNN (linear convolution layer) and an NIN (mlpconv layer). The structure design is based on the assumption that in conventional CNNs the earlier layers are trained to learn low-level features such as edges and corners that are linearly separable while the later layers are trained to learn more abstract high-level features that are nonlinearly separable. The mlpconv layer we use in this paper is a three-layer perceptron trainable by back-propagation and is able to generate one feature map for each corresponding class. The global average pooling layer computes the average of each feature map and leads to an n-dimensional feature vector (n is the number of image classes) which will be used for HRRSIR in this paper.
4. Experiments and Analysis
In this section, we evaluate the performance of the proposed schemes for HRRSIR on several publicly available remote sensing image datasets. We first introduce the datasets and experimental setup and then present the experimental results in detail.
The University of California, Merced dataset (UCMD) (http://vision.ucmerced.edu/datasets/landuse.html) is a challenging dataset containing 21 image classes: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts . Each class has 100 images with the size of 256 × 256 pixels and about one foot spatial resolution. This dataset is cropped from large aerial images downloaded from the United States Geological Survey (USGS). Figure 5 shows some sample images from this dataset.
The WHU-RS19 remote sensing dataset (RSD) (http://dsp.whu.edu.cn/cn/staff/yw/HRSscene.html) is collected from Google Earth imagery and consists of 19 classes: airport, beach, bridge, commercial area, desert, farmland, football field, forest, industrial area, meadow, mountain, park, parking, pond, port, railway station, residential area, river, and viaduct . The dataset contains a total of 1005 images and each image has a fixed size of 600 × 600 pixels. The spatial resolution is up to half a meter. Figure 6 shows some sample images from this dataset.
The RSSCN7 dataset (https://www.dropbox.com/s/j80iv1a0mvhonsa/RSSCN7.zip?dl=0) consists of seven land-use classes: grassland, forest, farmland, parking lot, residential region, industrial region, river, and lake . For each class, there are 400 images with the size of 400 × 400 pixels sampled on four different scale levels from Google Earth. Figure 7 shows some sample images from this dataset.
The aerial image dataset (AID) (http://www.lmars.whu.edu.cn/xia/AID-project.html) is a large-scale publicly available dataset . It is notably larger than the three datasets mentioned above. It is collected with the goal of advancing the state-of-the-art in scene classification of remote sensing images. The dataset consists of 30 scene types: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct. The dataset contains a total of 10,000 images. Each class has 220 to 420 images of size 600 × 600 pixels. This dataset is very challenging since its spatial resolution varies greatly between around 0.5 to 8 m. Figure 8 shows some samples images from AID.
UCMD has been widely used for image retrieval performance evaluation. However, it is relatively small and so the performance on this dataset has saturated. In contrast to UCMD, the other three datasets are more challenging due to the image scale, image size, and spatial resolution.
4.2. Experimental Setup
4.2.1. Implementation Details
The images are resized to 227 × 227 pixels for AlexNet and CaffeRef and to 224 × 224 pixels for the other networks, as these are the required input dimensions of CNNs. K-means clustering is used to construct the dictionaries for aggregating the Conv features. The dictionary sizes of BOVW, VLAD, and IFK are empirically set to 1000, 100, and 100, respectively.
Regarding the fine-tuning process, the weights of the convolutional layers and the first two fully-connected layers are transferred from the pre-trained CNN model, while the weights of the last fully-connected layer are initialized from a Gaussian distribution (with a mean of 0 and a standard deviation of 0.01).
In the case of LDCNN, the weights of the five convolutional layers are transferred from VGGM. We also tried using AlexNet, CaffeRef, VGGF, and VGGS but achieved comparable or slightly worse performance. The weights of convolutional layers are kept fixed during training in order to speed up training with the limited number of labelled remote sensing images. More specially, the layers after the last pooling layer of VGGM are removed and the remaining layers are preserved for LDCNN. The weights of the mlpconv layer are initialized from a Gaussian distribution (with a mean of 0 and a standard deviation is 0.01). The initial learning rate is set to 0.001 and is lowered by a factor of 10 when the accuracy on the validation set stops improving. Dropout is applied to the mlpconv layer.
The AID dataset is used to fine-tune the pre-trained CNNs and to train the LDCNN because it is the largest of the remote sensed datasets, having 10,000 images in total. The training data consists of 80% of the images per class selected at random. The remaining images constitute the validation data used to indicate when to stop training.
In the following experiments, we perform 2100 queries for the UCMD dataset, 1005 queries for the RSD dataset, 2800 queries for the RSSCN7 dataset, and 10,000 queries for the AID dataset. Euclidean distance is used as the similarity measure, and all the feature vectors are L2 normalized before similarity measure.
4.2.2. Performance Measures
The average normalized modified retrieval rank (ANMRR) and mean average precision (mAP) are used to evaluate the retrieval performance.
Let be a query image with the ground truth size of , be the retrieved rank of the k-th image, which is defined aswhere is used to impose a penalty on the retrieved images with a higher rank. The normalized modified retrieval rank (NMRR) is defined aswhere is the average rank. Then the final ANMRR can be defined aswhere is the number of queries. ANMRR ranges from zero to one, and a lower value means better retrieval performance.
Given a set of queries, mAP is defined aswhere AveP is the average precision defined aswhere is the precision at cutoff , is an indicator function equaling 1 if the image at rank is a relevant image and zero if otherwise. and are the rank and number of the retrieved images, respectively. Note that the average is over all relevant images.
We also use precision at k ([email protected]) as an auxiliary performance measure in the case that ANMRR and mAP achieve opposite results. Precision is defined as the fraction of retrieved images that are relevant to the query image.
4.3. Results of the First Scheme
4.3.1. Results of Fc Features
The results of using Fc features on the four datasets are shown in Table 2. For the UCMD dataset, the best result is obtained by using the Fc2 features of VGGM, which achieves an ANMRR value that is about 12% lower and a mAP value that is about 14% higher than that of the worst result which is achieved by VGGM128_Fc2. For the RSD dataset, the best result is obtained by using the Fc2 features of CaffeRef, which achieves an ANMRR value that is about 18% lower and a mAP value that is about 21% higher than that of the worst result, which is achieved by VGGM128_Fc2.
Note that sometimes the two performance measures ANMRR and mAP indicate “opposite” results. For example, VGGF_Fc2 and VGGM_Fc2 achieve better results on the RSSCN7 dataset than the other features in terms of ANMRR value, while VGGM_Fc1 performs the best on the RSSCN7 dataset in terms of mAP value. Such “opposite” results can also be found on the AID dataset in terms of VGGS_Fc1 and VGGS_Fc2 features. Here [email protected] is used to further investigate the performance of the features, as shown in Table 3. When the number of retrieved images k is 100 or smaller, we see that the Fc features of VGGM and in particular VGGM_Fc1 perform slightly better than VGGF_Fc2 in the case of the RSSCN7 dataset, and in the case of the AID dataset, VGGS_Fc1 performs slightly better than VGGS_Fc2.
It is interesting that VGGM performs better than its three variants on the four datasets, indicating the lower dimension of the Fc2 features does not improve the performance. However, in contrast to VGGM, these three variants have reduced storage and time cost due to the lower feature dimension. It can be also observed that Fc2 features perform better than the Fc1 features for most of the evaluated CNN models on the four datasets.
4.3.2. Results of Conv Features
Table 4 shows the results of the Conv features aggregated using BOVW, VLAD, and IFK on the four datasets. For the UCMD dataset, IFK performs better than BOVW and VLAD for all the evaluated CNN models and the best result is achieved by VD16 (ANMRR 0.407). It can also be observed that BOVW performs the worst except for VD16. This makes sense because BOVW ignores spatial information which is very important for remote sensing images when encoding local descriptors into a compact feature vector. For the other three datasets, VLAD has better performance than BOVW and IFK for most of the evaluated CNN models and the best results on the RSD, RSSCN7, and AID datasets are achieved by VD16 (ANMRR 0.342), VGGM (ANMRR 0.420) and VD16 (ANMRR 0.554), respectively. Moreover, we can see BOVW still has the worst performance among these three feature aggregation methods.
Similar to the Fc features, the Conv features also achieve some “opposite” results on the RSSCN7 and AID datasets. For example, VGGM_VLAD performs the best on the RSSCN7 dataset in terms of ANMRR value, while CaffeRef_IFK outperforms the other features in terms of mAP value. The “opposite” results can also be found on the AID dataset with respect to the VD16_VLAD and VGGS_VLAD features. Here [email protected] is also used to further investigate the performances of these features, as shown in Table 5. It is clear that VGGM_VLAD achieves better performance than CaffeRef_IFK when the number of retrieved images is 100 or smaller. In the case of the AID dataset, VGGS_VLAD has better performance when the number of retrieved images is 10 or smaller or is 1000.
4.3.3. Effect of ReLU on Fc and Conv Features
The Fc and Conv features can be extracted either with or without the use of the ReLU transformation. Deep features with ReLU means ReLU is applied to generate non-negative features which can improve the nonlinearity of the features, while deep features without ReLU mean the opposite. To give an extensive evaluation of deep features, the effect of ReLU on the Fc and Conv features is investigated.
Figure 9 shows the results of Fc features extracted by different CNN models with and without ReLU on the four datasets. We can see that Fc1 features (without ReLU) perform better than Fc1_ReLU features, indicating the nonlinearity does not contribute to the performance improvement as expected. However, we tend to achieve opposite results (Fc2_ReLU achieves better or similar performance compared with Fc2) in terms of Fc2 and Fc2_ReLU features except for VGGM128, VGGM1024, and VGGM2048. A possible explanation is that the Fc2 layer can generate higher-level features which are then used as input of the classification layer (Fc3 layer) and thus improving nonlinearity will benefit the final results.
Figure 10 shows the results of Conv features aggregated by BOVW, VLAD, and IFK on the four datasets. It can be observed that the use of ReLU (BOVW_ReLU, VLAD_ReLU, and IFK_ReLU) decreases the performance of the evaluated CNN models on the four datasets except for the UCMD dataset. Regarding the UCMD dataset, the use of ReLU decreases the performance in the case of BOVW but improves the performance in the case of VLAD (except for VD16 and VD19) and IFK. In addition, it is interesting to find that BOVW achieves better performance than VLAD and IFK on the UCMD dataset and IFK performs better than BOVW and VLAD on the other three datasets in terms of Conv features without the use of ReLU.
Table 6 shows the best results of Fc and Conv features on the four benchmark datasets. The models that achieve these results are also shown in the table. It is interesting to find that the very deep models (VD16 and VD19) perform worse than the relatively shallow models in terms of Fc features but perform better in terms of Conv features. These results indicate that network depth has an effect on the retrieval performance. Specially, increasing the convolutional layer depth will improve the performance of Conv features but will decrease the performance of Fc features. For example, VD16 has 13 convolutional layers and 3 fully-connected layers while VGGS has 5 convolutional layers and 3 fully-connected layers. However, VD16 achieves better performance on the AID dataset in terms of Conv features but achieves worse performance in terms of Fc features. This makes sense because the Fc layers can generate high-level domain-specific features while the Conv layers can generate more generic features. In addition, more data and techniques are needed in order to train a successful very deep model such as VD16 and VD19 than the relatively shallow models. Moreover, very deep models will also have higher time costs than relatively shallow models. Therefore, it is wise to balance the tradeoff between the performance and efficiency in practice, especially for large-scale tasks such as image recognition and image retrieval.
We also consider combinations of the Fc and Conv features in Table 6. Three new features are generated by pairing the best features from the three sets: (Fc1, Fc1_ReLU), (Fc2, Fc2_ReLU), and (Conv, Conv_ReLU). The results of the combined features on each benchmark dataset are shown in Table 7. The combination of Fc1 and Fc2 achieves comparative or slightly better performance than the individual features, while the combination of Fc and Conv tends to decrease the performance. One possible explanation for this is that the combined Fc and Conv feature is of high dimension.
4.3.4. Comparisons with State-of-the-Art
As shown in Table 8, we compare the best result (0.374 achieved by Fc1 + Fc2) of deep features on the UCMD dataset with several state-of-the-art methods including local invariant features , VLAD-PQ , morphological texture , and UCNN4 feature extracted by IRMFRCAMF . We can see the deep features result in remarkable performance and improve the state-of-the-art by a significant margin.
We also compare the deep features with several state-of-the-art methods including local binary pattern (LBP) , spatial envelope model (GIST) , and bag of visual words (BOVW)  on the other three benchmark datasets. These approaches are widely used for remote sensing scene classification [41,44]. LBP is used to extract local texture information. In our implementation, 8 pixels circular neighbor of radius 1 is used to extract a 10-D uniform rotation invariant histogram. GIST is based on the spatial envelope model to represent the dominant spatial structure of an image. The default parameters of the original implementation are used to extract a 512-D feature vector from an image. For BOVW, we use K-means to cluster the local descriptors to form a dictionary with the size of K. In our experiments, scale invariant feature transform (SIFT)  is used to extract local descriptors and K is set to 1500.
Table 9 shows the results of these state-of-the-art approaches on the RSD, RSSCN7, and AID datasets. We can see the deep features greatly improve the performance of state-of-the-art on each benchmark dataset.
4.4. Results of the Fine-Tuned CNN and the Proposed LDCNN
The proposed LDCNN results in a 30-D dimensional feature vector which is very compact compared to deep features extracted by the pre-trained and fine-tuned CNNs. The pre-trained CNN that achieves the best performance in terms of Fc features on each dataset is selected and fine-tuned. Table 10 shows the results of the fine-tuned CNNs and the proposed LDCNN on these three benchmark datasets. We can see the proposed LDCNN greatly improves the best results of the pre-trained CNNs on RSD and RSSCN7 datasets by 26% and 8.3%, respectively, and the performance is even better than that of the fine-tuned Fc features. However, LDCNN performs worse than the pre-trained and the fine-tuned CNNs on UCMD dataset. These results make sense because LDCNN is trained on the AID dataset which is quite different from the UCMD dataset but similar to the other two datasets and in particular the RSD dataset, in terms of image size and spatial resolution. Figure 11 shows some example images of the same class taken from the four datasets. It is evident that the images of UCMD are at a different scale level with respect to the images of AID dataset, while the images of RSD and RSSCN7 (especially RSD) are similar to that of AID dataset. In addition, we can see the fine-tuned CNN only achieves slightly better performance than the pre-trained CNN on UCMD dataset, which also indicates that the worse performance of LDCNN on UCMD is due to the differences between the UCMD and AID datasets.
Figure 12 shows the number of parameters (weights and biases) contained in the mlpconv layer of LDCNN and in the three Fc layers of the pre-trained and fine-tuned CNNs. It is observed that LDCNN has about 2.6 times fewer parameters than the fine-tuned CNNs and 2.7 times fewer parameters than the pre-trained CNNs. This results in LDCNN being significantly easier to train than the other approaches.
From the extensive experiments, the two proposed schemes were proven to be effective methods for HRRSIR. Some practical observations from the experiments are summarized as follows:
- The deep feature representations extracted by the pre-trained CNNs, the fine-tuned CNNs, and the proposed LDCNN achieved superior performance to state-of-the-art hand-crafted features. The results indicate that CNN can generate powerful feature representations for HRRSIR.
- In the first scheme, the pre-trained CNNs were regarded as feature extractors to extract Fc and Conv features without any labelled images. Fc features can be directly extracted from Fc1 and Fc2 layers, while feature aggregation methods (BOVW, VLAD, and IFK) are needed in order to utilize Conv features. To give an extensive evaluation of the deep features, we investigated the effect of ReLU on Fc and Conv features. The results show that Fc1 features achieved better performance than Fc2 feature without the use of ReLU but achieved worse performance with the use of ReLU, indicating that nonlinearity improves retrieval performance if the features are extracted from higher layers. Regarding the Conv features, it is interesting that the use of ReLU decreased the performance of Conv features on the benchmark datasets except for the UCMD dataset. A possible explanation is that these existing feature aggregation methods are designed for traditional hand-crafted features which have quite different distributions of pairwise similarities from deep features .
- In the second scheme, we proposed two approaches to further improve the performance of the pre-trained CNNs with limited labelled images. The first approach was fine-tuning the pre-trained CNNs on the target remote sensing dataset to learn domain-specific features. While for the second approach, a novel CNN that can learn low dimensional features for HRRSIR was proposed based on conventional convolution layers and a three-layer perceptron. In order to speed up training, the parameters of the convolution layers were transferred from VGGM. The proposed LDCNN was able to generate 30-D dimensional feature vectors which are more compact than the Fc and Conv features. As shown in Table 10, LDCNN outperformed the pre-trained CNNs and the fine-tuned CNNs on the RSD and RSSCN7 datasets but achieved worse performance on the UCMD dataset. This is because the images in UCMD and AID are quite different in terms of image size and spatial resolution. LDCNN provides a direction for us to directly learn low dimensional features from CNN which can achieve remarkable performance.
- LDCNN was trained on the AID dataset and then applied to three new remote sensing datasets (UCMD, RSD, and RSSCN7). The remarkable performances on RSD and RSSCN7 datasets indicate that CNN has strong transferability. However, the performance should be further improved if LDCNN is trained on the target dataset. To this end, a much larger dataset needs to be constructed.
We presented two effective schemes to extract deep feature representations for HRRSIR. In the first scheme, the features were extracted from the fully-connected and convolutional layers of a pre-trained CNN, respectively. The Fc features could be directly used for similarity measure, while the Conv features were encoded by feature aggregation methods to generate compact feature vectors before similarity measure. We also investigated the effect of ReLU on Fc and Conv features.
Though the first scheme was able to achieve better performance than traditional hand-crafted features, the pre-trained CNN models were trained on ImageNet, which is quite different from remote sensing datasets. Fine-tuning the pre-trained CNNs on the target dataset is a common strategy to improve the performance of the pre-trained CNNs, however, the features extracted from the fine-tuned Fc layers are 4096-D feature vectors, which are of high dimension for large-scale image retrieval. Thus, we propose a novel CNN architecture based on conventional convolution layers and a three-layer perceptron which is then trained on a large remote sensing dataset. The proposed LDCNN is able to generate 30-D features that can achieve remarkable performance on several remote sensing datasets.
While LDCNN is designed for HRRSIR, it can also be applied to other remote sensing tasks such as scene classification and object detection.
The authors would like to thank Paolo Napoletano for the code used in the performance evaluation. This work was supported by the National Key Technologies Research and Development Program (2016YFB0502603), Fundamental Research Funds for the Central Universities (2042016kf0179 and 2042016kf1019), Wuhan Chen Guang Project (2016070204010114), Special task of technical innovation in Hubei Province (2016AAA018), and the Natural Science Foundation of China (61671332).
The research idea and design was conceived by Weixun Zhou and Zhenfeng Shao. The experiments were performed by Weixun Zhou and Congmin Li. The manuscript was written by Weixun Zhou. Shawn Newsam gave many suggestions and helped revise the manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
- Bretschneider, T.; Cavet, R.; Kao, O. Retrieval of remotely sensed imagery using spectral information content. In Proceedings of the IEEE International Geoscience & Remote Sensing Symposium, Toronto, ON, Canada, 24–28 June 2002; pp. 2253–2255. [Google Scholar]
- Scott, G.J.; Klaric, M.N.; Davis, C.H.; Shyu, C.R. Entropy-balanced bitmap tree for shape-based object retrieval from large-scale satellite imagery databases. IEEE Trans. Geosci. Remote Sens. 2011, 49, 1603–1616. [Google Scholar] [CrossRef]
- Shao, Z.; Zhou, W.; Zhang, L.; Hou, J. Improved color texture descriptors for remote sensing image retrieval. J. Appl. Remote Sens. 2014, 8, 83584. [Google Scholar] [CrossRef]
- Zhu, X.; Shao, Z. Using no-parameter statistic features for texture image retrieval. Sens. Rev. 2011, 31, 144–153. [Google Scholar] [CrossRef]
- Aptoula, E. Remote sensing image retrieval with global morphological texture descriptors. IEEE Trans. Geosci. Remote Sens. 2014, 52, 3023–3034. [Google Scholar] [CrossRef]
- Goncalves, H.; Corte-Real, L.; Goncalves, J.A. Automatic image registration through image segmentation and SIFT. IEEE Trans. Geosci. Remote Sens. 2011, 49, 2589–2600. [Google Scholar] [CrossRef]
- Sedaghat, A.; Mokhtarzade, M.; Ebadi, H. Uniform robust scale-invariant feature matching for optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2011, 49, 4516–4527. [Google Scholar] [CrossRef]
- Sirmacek, B.; Unsalan, C. Urban area detection using local feature points and spatial voting. IEEE Geosci. Remote Sens. Lett. 2010, 7, 146–150. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Geographic image retrieval using local invariant features. IEEE Trans. Geosci. Remote Sens. 2013, 51, 818–832. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Geoffrey, H. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Wan, J.; Wang, D.; Hoi, S.C.H.; Wu, P.; Zhu, J.; Zhang, Y.; Li, J. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 157–166. [Google Scholar]
- Cheriyadat, A.M. Unsupervised feature learning for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2014, 52, 439–451. [Google Scholar] [CrossRef]
- Zhou, W.; Shao, Z.; Diao, C.; Cheng, Q. High-resolution remote-sensing imagery retrieval using sparse features by auto-encoder. Remote Sens. Lett. 2015, 6, 775–783. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, Y.; Tao, C.; Zhu, H. Content-Based high-resolution remote sensing image retrieval via unsupervised feature learning and collaborative affinity metric fusion. Remote Sens. 2016, 8, 709. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–26 June 2009; pp. 2–9. [Google Scholar]
- Marmanis, D.; Datcu, M.; Esch, T.; Stilla, U. Deep Learning earth observation classification using Imagenet Pretrained Networks. IEEE Geosci. Remote Sens. Lett. 2015, 13, 1–5. [Google Scholar] [CrossRef]
- Penatti, O.A.B.; Nogueira, K.; Dos Santos, J.A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Castelluccio, M.; Poggi, G.; Sansone, C.; Verdoliva, L. Land use classification in remote sensing images by convolutional neural networks. arXiv, 2015; arXiv:1508.00092. [Google Scholar]
- Chandrasekhar, V.; Lin, J.; Morère, O.; Goh, H.; Veillard, A. A practical guide to CNNs and Fisher Vectors for image instance retrieval. Signal Process. 2016, 128, 426–439. [Google Scholar] [CrossRef]
- Yandex, A.B.; Lempitsky, V. Aggregating local deep features for image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1269–1277. [Google Scholar]
- Stutz, D. Neural Codes for Image Retrieval. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 584–599. [Google Scholar]
- Napoletano, P. Visual descriptors for content-based retrieval of remote sensing images. arXiv, 2016; arXiv:1602.00970. [Google Scholar]
- Nanni, L.; Ghidoni, S. How could a subcellular image, or a painting by Van Gogh, be similar to a great white shark or to a pizza? Pattern Recognit. Lett. 2017, 85, 1–7. [Google Scholar] [CrossRef]
- Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical evaluation of rectified activations in convolution network. arXiv, 2015; arXiv:1505.00853. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11), Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar]
- Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
- Chatfield, K.; Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference, Nottinghamshire, UK, 1–5 September 2014; pp. 1–11. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv, 2012; arXiv:1207.0580. [Google Scholar]
- Vedaldi, A.; Lenc, K. MatConvNet. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 689–692. [Google Scholar]
- Özkan, S.; Ateş, T.; Tola, E.; Soysal, M.; Esen, E. Performance analysis of state-of-the-art representation methods for geographical image retrieval and categorization. IEEE Geosci. Remote Sens. Lett. 2014, 11, 1996–2000. [Google Scholar] [CrossRef]
- Gong, Y.; Wang, L.; Guo, R.; Lazebnik, S. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 392–407. [Google Scholar]
- Ng, J.Y.; Yang, F.; Davis, L.S. Exploiting Local Features from Deep Networks for Image Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 53–61. [Google Scholar]
- Sivic, J.; Zisserman, A. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 1470–1477. [Google Scholar]
- Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact representation. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3304–3311. [Google Scholar]
- Perronnin, F.; Sánchez, J.; Mensink, T. Improving the Fisher Kernel for Large-Scale Image Classification. In Proceedings of the European Conference on Computer Vision, Heraklion, Greece, 5–11 September 2010; pp. 143–156. [Google Scholar]
- Lin, M.; Chen, Q.; Yan, S. Network in Network. arXiv, 2014; arXiv:1312.4400. [Google Scholar]
- Xia, G.-S.; Yang, W.; Delon, J.; Gousseau, Y.; Sun, H.; Maître, H. Structural high-resolution satellite image indexing. In Proceedings of the ISPRS TC VII Symposium-100 Years ISPRS, Vienna, Austria, 5–7 July 2010; pp. 298–303. [Google Scholar]
- Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
- Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L. AID: A Benchmark dataset for performance evaluation of aerial scene classification. arXiv, 2016; arXiv:1608.05167. [Google Scholar]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. arXiv, 2017; arXiv:1703.00121. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Figure 1. The typical architecture of convolutional neural networks (CNNs). The rectified linear units (ReLU) layers are ignored here for conciseness.
Figure 2. Flowchart of the first scheme: deep features extracted from Fc2 and Conv5 layers of the pre-trained CNN model. For conciseness, we refer to features extracted from Fc1–2 and Conv1–5 layers as Fc features (Fc1, Fc2) and Conv features (Conv1, Conv2, Conv3, Conv4, Conv5), respectively.
Figure 3. Flowchart of extracting features from the fine-tuned layers. Dropout1 and dropout2 are dropout layers which are used to control overfitting. N is the number of image classes in the target dataset.
Figure 4. The overall structure of the proposed, novel CNN architecture. There are five linear convolution layers and an mlpconv layer followed by a global average pooling layer.
Figure 5. Sample images from the University of California, Merced dataset (UCMD) dataset. From the top left to bottom right: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.
Figure 6. Sample images from the remote sensing dataset (RSD). From the top left to bottom right: airport, beach, bridge, commercial area, desert, farmland, football field, forest, industrial area, meadow, mountain, park, parking, pond, port, railway station, residential area, river, and viaduct.
Figure 7. Sample images from the RSSCN7 dataset. From left to right: grass, field, industry, lake, resident, and parking.
Figure 8. Samples images from the aerial image dataset (AID). From the top left to bottom right: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct.
Figure 9. The effect of ReLU on Fc1 and Fc2 features. (a) Results on UCMD dataset; (b) Results on RSD dataset; (c) Results on RSSCN7 dataset; (d) Results on AID dataset. For Fc1_ReLU and Fc2_ReLU features, ReLU is applied to the extracted Fc features.
Figure 10. The effect of ReLU on Conv features. (a) Results on UCMD dataset; (b) Results on RSD dataset; (c) Results on RSSCN7 dataset; (d) Results on AID dataset. For BOVW_ReLU, VLAD_ReLU, and IFK_ReLU features, ReLU is applied to the Conv features before feature aggregation.
Figure 11. Comparison between images of the same class from the four datasets. From left to right, the images are from UCMD, RSSCN7, RSD, and AID respectively.
Figure 12. The number of parameters contained in VGGM, fine-tuned VGGM, and LDCNN.
Table 1. The architectures of the evaluated CNN Models. Conv1–5 are five convolutional layers and Fc1–3 are three fully-connected layers. For each of the convolutional layers, the first row specifies the number of filters and corresponding filter size as “size × size × number”; the second row indicates the convolution stride; and the last row indicates if Local Response Normalization (LRN) is used. For each of the fully-connected layers, its dimensionality is provided. In addition, dropout is applied to Fc1 and Fc2 to overcome overfitting.
|AlexNet||11 × 11 × 96||5 × 5 × 256||3 × 3 × 384||3 × 3 × 384||3 × 3 × 256||4096 dropout||4096 dropout||1000 softmax|
|stride 4||stride 1||stride 1||stride 1||stride 1|
|CaffeRef||11 × 11 × 96||5 × 5 × 256||3 × 3 × 384||3 × 3 × 384||3 × 3 × 256||4096 dropout||4096 dropout||1000 softmax|
|stride 4||stride 1||stride 1||stride 1||stride 1|
|VGGF||11 × 11 × 64||5 × 5 × 256||3 × 3 × 256||3 × 3 × 256||3 × 3 × 256||4096 dropout||4096 dropout||1000 softmax|
|stride 4||stride 1||stride 1||stride 1||stride 1|
|VGGM||7 × 7 × 96||5 × 5 × 256||3 × 3 × 512||3 × 3 × 512||3 × 3 × 512||4096 dropout||4096 dropout||1000 softmax|
|stride 2||stride 2||stride 1||stride 1||stride 1|
|VGGM-128||7 × 7 × 96||5 × 5 × 256||3 × 3 × 512||3 × 3 × 512||3 × 3 × 512||4096 dropout||128 dropout||1000 softmax|
|stride 2||stride 2||stride 1||stride 1||stride 1|
|VGGM-1024||7 × 7 × 96||5 × 5 × 256||3 × 3 × 512||3 × 3 × 512||3 × 3 × 512||4096 dropout||1024 dropout||1000 softmax|
|stride 2||stride 2||stride 1||stride 1||stride 1|
|VGGM-2048||7 × 7 × 96||5 × 5 × 256||3 × 3 × 512||3 × 3 × 512||3 × 3 × 512||4096 dropout||2048 dropout||1000 softmax|
|stride 2||stride 2||stride 1||stride 1||stride 1|
|VGGS||7 × 7 × 96||5 × 5 × 256||3 × 3 × 512||3 × 3 × 512||3 × 3 × 512||4096 dropout||4096 dropout||1000 softmax|
|stride 2||stride 1||stride 1||stride 1||stride 1|
Table 2. The performances of Fc features (ReLU is used) extracted by different CNN models on the four datasets. For average normalized modified retrieval rank (ANMRR), lower values indicate better performance, while for mean average precision (mAP), larger is better. The best result for each dataset is reported in bold.
Table 3. The precision at k ([email protected]) values of Fc features that achieve inconsistent results on the RSSCN7 and AID datasets for the ANMRR and mAP measures.
Table 4. The performance of the Conv features (ReLU is used) aggregated using bag of visual words (BOVW), vector of locally aggregated descriptors (VLAD), and improved fisher kernel (IFK) on the four datasets. For ANMRR, lower values indicate better performance, while for mAP, larger is better. The best result for each dataset is reported in bold.
Table 5. The [email protected] values of Conv features that achieve inconsistent results on the RSSCN7 and AID datasets for the ANMRR and mAP measures.
Table 6. The models that achieve the best results in terms of Fc and Conv features on each dataset. The numbers are ANMRR values.
Table 7. The results of the combined features on each dataset. For the UCMD dataset, Fc1, Fc2_ReLU, and Conv_ReLU features are selected; for the RSD dataset, Fc1, Fc2_ReLU, and Conv features are selected; for the RSSCN7 dataset, Fc1, Fc2_ReLU, and Conv features are selected; for the AID dataset, Fc1, Fc2_ReLU, and Conv features are selected. “+” means the combination of two features. The numbers are ANMRR values.
|Fc1 + Fc2||0.374||0.286||0.407||0.517|
|Fc1 + Conv||0.375||0.286||0.408||0.518|
|Fc2 + Conv||0.378||0.283||0.433||0.523|
Table 8. Comparisons of deep features with state-of-the-art methods on the UCMD dataset. The numbers are ANMRR values.
|Deep Features||Local Features ||VLAD-PQ ||Morphological Texture ||IRMFRCAMF |
Table 9. Comparisons of deep features with state-of-the-art methods on RSD, RSSCN7, and AID datasets. The numbers are ANMRR values.
Table 10. Results of the pre-trained CNN, the fine-tuned CNN, and the proposed LDCNN on the UCMD, RSD, and RSSCN7 datasets. For the pre-trained CNN, we choose the best results, as shown in Table 6 and Table 7; for the fine-tuned CNN, VGGM, CaffeRef, and VGGS are fine-tuned on AID and then applied to UCMD, RSD, and RSSCN7, respectively. The numbers are ANMRR values.
|Datasets||Pre-Trained CNN||Fine-Tuned CNN||LDCNN|
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).