In this section, we firstly provide a detailed introduction to the dimensionality reduction method, and then comprehensively review the existing research works of high-resolution RSIR based on CNN.
2.1. Dimensionality Reduction Method
Dimensionality reduction is used to map high-dimensional data to low-dimensional data [
4]. According to the types of mapping function, the dimensionality reduction algorithms can be divided into linear dimensionality reduction algorithms and nonlinear dimensionality reduction algorithms. Linear dimensionality reduction constructs a linear function based on the sample set to realize the mapping from high-dimensional to low-dimensional space, such as principal components analysis (PCA) [
4], linear discriminant analysis (LDA) [
5] and multidimensional scaling (MDS) [
6], etc., which have been widely used in various image retrieval schemes.
The non-linear dimensionality reduction method conducts dimensionality reduction of high-dimensional data by constructing a nonlinear map, which is adopted to produce a better performance when the data are non-linear. The representative methods are locally linear embedding (LLE) [
7], locality preserving projection (LPP) [
8], stochastic neighbor embedding (SNE) [
2],
t-distributed stochastic neighbor embedding (
t-SNE) [
3] and LargeVis [
9], etc. Among them, SNE,
t-SNE and LargeVis are designed for the visualization of large-scale high-dimensional data. Zhang et al. [
10] utilized a
t-SNE-based nonlinear manifold hashing algorithm to make dimensionality reduction by learning compact binary codes embedded on the intrinsic manifolds of deep spectral-spatial features for balancing between learning efficiency and retrieval accuracy.
Compared with t-SNE, LargeVis can achieve an equivalent or even better dimensionality reduction effect. t-SNE and LargeVis can effectively increase the distance of different classes of data that are far apart in high-dimensional space while decreasing the distance within the class. In addition, the computational complexity of LargeVis is much lower than t-SNE. However, there are two problems that need to be solved when applying LargeVis to image retrieval.
LargeVis uses the distance relationship of data to reduce dimensionality, and cannot reduce the dimensionality of high-dimensional data of a single image, so it is necessary to extend LargeVis to meet the requirements of image retrieval.
LargeVis is unfavorable for image retrieval since it has a high degree of randomness, leading to the different results of dimensionality reduction. It is necessary to eliminate the randomness while taking advantage of the clustering characteristics of LargeVis data.
2.2. High-Resolution Remote Sensing Image Retrieval Based on CNN
In recent years, deep learning has made tremendous breakthroughs in many fields such as speech recognition, natural language processing, and computer vision. The most famous deep neural network—convolutional neural network—adopts deep hierarchical architectures with parameters of each layer learned from large labeled classification datasets [
11]. Deep features are more robust, discriminative and representative than hand-crafted features, especially for image context information extraction.
As mentioned above, CBIR is an image retrieval framework that was proposed in the 1990s, which included image feature extraction and similarity measurement. It is very important for CBIR to extract the representative and discriminative image features [
12]. Recently, researchers applied CNN to extract deep features for image retrieval and achieved much better performance than traditional methods [
13], which has become a mainstream solution of high-resolution RSIR [
14]. The overall framework is shown in
Figure 1.
Figure 1 shows that the deep features based on CNN mainly include two types: Convolutional layer features and fully connected layer features. The convolutional layer features contain more details that come from low levels of the CNN, and the fully connected layer features focus more on semantics that come from the high levels of the CNN.
● Convolutional layer features
The output of the CNN convolution layers is feature maps obtained by convolving the image with convolution kernels of various sizes and parameters. Since different convolution kernels have distinctness and diversity in the ability to describe image features, CNN can obtain more abundant image feature representation. In fact, the feature map of the convolutional layer cannot be directly used as an image descriptor, which usually needs to be compactly represented as a descriptor through a coding or pooling operation if it is applied for retrieval or classification.
Zhou et al. [
15], Hu et al. [
16] and Xia et al. [
17] systematically conducted a comparative experiment on retrieval performance by using CNN convolutional layer features and fully connected layer features. In the comparative experiment, AlexNet [
11], VGGNet [
18], and GoogLeNet [
19] were used as CNN backbone networks to extract convolutional layer features. Then, bag-of-words (BoW) [
20], improved fisher kernel (IFK) [
21] and vector locally aggregated descriptors (VLAD) [
22], etc., were used to encode the convolutional layer features. Finally, various pooling methods, including max pooling, average pooling, hybrid pooling, SPoC [
23] and CroW [
24] were compared and analyzed. The experimental results showed that encoding the convolutional layer features can obtain better retrieval performance than the pooling operation and the convolutional layer features were better than the fully connected layer features.
Wang et al. [
25] proposed an image retrieval method based on bilinear pooling, in which the ImageNet dataset [
26] was used to pre-train VGG-16 [
18] and ResNet34 [
27] networks and the convolutional layer features of the two networks were weighted by the channel and spatial attention mechanisms to assign higher weights to useful feature channels for the retrieval task. The deep feature vectors were obtained by fusing the last convolutional layer features of the two networks with bilinear pooling. Finally, PCA was adopted to reduce the dimensionality of the deep features for image retrieval. The experimental results show that this method can achieve better retrieval performance than other pooling methods.
● Fully connected layer features
Fully connected layer features represent the global information of the image, reflecting the semantic information. Napoletano [
28] thoroughly explored the impact of network training strategies on retrieval performance and found that the retrieval performance of fully connected layer features on remote sensing images significantly exceeded that of hand-crafted features when only pre-training CNN, and the extracted fully connected layer features achieved optimal performance when using the “pre-training + fine-tuning” CNN training strategy.
However, the dimensionality of the fully connected layer features is often relatively high, leading to the problem of the “curse of dimensionality”. As a result, Xiao et al. [
29] proposed a deep compact code (DCC) method to extract low-dimensional CNN features, in which the second fully connected layer of AlexNet and VGGNet networks is utilized to obtain low-dimensional features. Compared with usual CNN features and low-dimensional features obtained by PCA, the low-dimensional features obtained by DCC can effectively improve the performance of RSIR, especially in 64 dimensions.
The fully connected layer features mainly contain semantic information, lacking local details and positional information of the image. Hu et al. [
16] and Xia et al. [
17] proposed a fully connected layer feature extraction method based on multiple blocks or regions, in which the fully connected layer features of each block are extracted separately to cascade them after dividing the image into blocks. Then, these features were aggregated by using maximum pooling, mean pooling, and mixed pooling, and PCA was used to generate the low-dimensional features for image retrieval. The experimental results show that the fully connected layer features extraction with blocks method can solve the problem of providing positional information. Compared with the method of extracting fully connected layer features from the whole image, it can effectively improve retrieval performance.
Li et al. [
30] proposed a feature extraction method for fully connected layers based on regions of interest (ROIs). First, the ROIs of the image were determined. Then, the fully connected layer features of the ROIs were extracted and further encoded by VLAD; finally, PCA was used to reduce the dimensionality of the features. The experimental results show that this method can obtain higher retrieval performance than the methods of extracting fully connected layer features from the whole image.
Another core part of CBIR is similarity measurement, in which distance measurement is the most commonly used method. Until now, many researchers have carried out the work on learning-based distance measurement. The learning-based distance measurement method is to learn embedding space so that the distance of similar features is closer, and the distance of dissimilar features is further away. Ye et al. [
31] used the similarity of image classes to sort the CNN feature distance of the query image and each retrieved image in ascending order to obtain the initial retrieval result. Then, the initial retrieval results are reordered by a weight, which is calculated from the query image and each class according to the initial retrieval results. The retrieval performance is superior to the state-of-the-art methods. Cao et al. [
32] proposed a triple network that outputs the feature vectors of images, positive and negative samples and normalizes them. Finally, the distance of feature vectors is used to calculate the loss value, to get closer to the positive samples and further from the negative samples. The final retrieval performance is significantly better than the existing methods. Moreover, Zhang et al. [
33] introduced correlation feedback based on feature weighting to further improve retrieval accuracy.