Low Dimensional Discriminative Representation of Fully Connected Layer Features Using Extended LargeVis Method for High-Resolution Remote Sensing Image Retrieval

Recently, there have been rapid advances in high-resolution remote sensing image retrieval, which plays an important role in remote sensing data management and utilization. For content-based remote sensing image retrieval, low-dimensional, representative and discriminative features are essential to ensure good retrieval accuracy and speed. Dimensionality reduction is one of the important solutions to improve the quality of features in image retrieval, in which LargeVis is an effective algorithm specifically designed for Big Data visualization. Here, an extended LargeVis (E-LargeVis) dimensionality reduction method for high-resolution remote sensing image retrieval is proposed. This can realize the dimensionality reduction of single high-dimensional data by modeling the implicit mapping relationship between LargeVis high-dimensional data and low-dimensional data with support vector regression. An effective high-resolution remote sensing image retrieval method is proposed to obtain stronger representative and discriminative deep features. First, the fully connected layer features are extracted using a channel attention-based ResNet50 as a backbone network. Then, E-LargeVis is used to reduce the dimensionality of the fully connected features to obtain a low-dimensional discriminative representation. Finally, L2 distance is computed for similarity measurement to realize the retrieval of high-resolution remote sensing images. The experimental results on four high-resolution remote sensing image datasets, including UCM, RS19, RSSCN7, and AID, show that for various convolutional neural network architectures, the proposed E-LargeVis can effectively improve retrieval performance, far exceeding other dimensionality reduction methods.


Introduction
With the rapid development of high-resolution remote sensing and ground observation technology over recent years, the quantity of remote sensing imagery data has increased exponentially. The inability to quickly browse and efficiently find required images from large-scale remote sensing archives has created a bottleneck and causes problems related to remote sensing information management and sharing, which directly influence the utilization of remote sensing data. Content-based image retrieval (CBIR) [1] as a mainstream retrieval solution was proposed in the 1990s and has gradually developed; this has been widely applied in high-resolution remote sensing image retrieval (RSIR). CBIR includes two essential components: feature extraction and similarity measurement, in which image content is represented as image features, and the retrieval results are obtained by measuring feature similarity 1.
LargeVis uses the distance relationship of data to reduce dimensionality, and cannot reduce the dimensionality of high-dimensional data of a single image, so it is necessary to extend LargeVis to meet the requirements of image retrieval.

2.
LargeVis is unfavorable for image retrieval since it has a high degree of randomness, leading to the different results of dimensionality reduction. It is necessary to eliminate the randomness while taking advantage of the clustering characteristics of LargeVis data.

High-Resolution Remote Sensing Image Retrieval Based on CNN
In recent years, deep learning has made tremendous breakthroughs in many fields such as speech recognition, natural language processing, and computer vision. The most famous deep neural network-convolutional neural network-adopts deep hierarchical architectures with parameters of each layer learned from large labeled classification datasets [11]. Deep features are more robust, discriminative and representative than hand-crafted features, especially for image context information extraction.
As mentioned above, CBIR is an image retrieval framework that was proposed in the 1990s, which included image feature extraction and similarity measurement. It is very important for CBIR to extract the representative and discriminative image features [12]. Recently, researchers applied CNN to extract deep features for image retrieval and achieved much better performance than traditional methods [13], which has become a mainstream solution of high-resolution RSIR [14]. The overall framework is shown in Figure 1. Figure 1 shows that the deep features based on CNN mainly include two types: Convolutional layer features and fully connected layer features. The convolutional layer features contain more details that come from low levels of the CNN, and the fully connected layer features focus more on semantics that come from the high levels of the CNN. •

Convolutional layer features
The output of the CNN convolution layers is feature maps obtained by convolving the image with convolution kernels of various sizes and parameters. Since different convolution kernels have distinctness and diversity in the ability to describe image features, CNN can obtain more abundant image feature representation. In fact, the feature map of the convolutional layer cannot be directly used as an image descriptor, which usually needs to be compactly represented as a descriptor through a coding or pooling operation if it is applied for retrieval or classification. Zhou et al. [15], Hu et al. [16] and Xia et al. [17] systematically conducted a comparative experiment on retrieval performance by using CNN convolutional layer features and fully connected layer features. In the comparative experiment, AlexNet [11], VGGNet [18], and GoogLeNet [19] were used as CNN backbone networks to extract convolutional layer features. Then, bag-of-words (BoW) [20], improved fisher kernel (IFK) [21] and vector locally aggregated descriptors (VLAD) [22], etc., were used to encode the convolutional layer features. Finally, various pooling methods, including max pooling, average pooling, hybrid pooling, SPoC [23] and CroW [24] were compared and analyzed. The experimental results showed that encoding the convolutional layer features can obtain better retrieval performance than the pooling operation and the convolutional layer features were better than the fully connected layer features.
Wang et al. [25] proposed an image retrieval method based on bilinear pooling, in which the ImageNet dataset [26] was used to pre-train VGG-16 [18] and ResNet34 [27] networks and the convolutional layer features of the two networks were weighted by the channel and spatial attention mechanisms to assign higher weights to useful feature channels for the retrieval task. The deep feature vectors were obtained by fusing the last convolutional layer features of the two networks with bilinear pooling. Finally, PCA was adopted to reduce the dimensionality of the deep features for image retrieval. The experimental results show that this method can achieve better retrieval performance than other pooling methods. •

Fully connected layer features
Fully connected layer features represent the global information of the image, reflecting the semantic information. Napoletano [28] thoroughly explored the impact of network training strategies on retrieval performance and found that the retrieval performance of fully connected layer features on remote sensing images significantly exceeded that of hand-crafted features when only pre-training CNN, and the extracted fully connected layer features achieved optimal performance when using the "pre-training + fine-tuning" CNN training strategy.
However, the dimensionality of the fully connected layer features is often relatively high, leading to the problem of the "curse of dimensionality". As a result, Xiao et al. [29] proposed a deep compact code (DCC) method to extract low-dimensional CNN features, in which the second fully connected layer of AlexNet and VGGNet networks is utilized to obtain low-dimensional features. Compared with usual CNN features and low-dimensional features obtained by PCA, the low-dimensional features obtained by DCC can effectively improve the performance of RSIR, especially in 64 dimensions.
The fully connected layer features mainly contain semantic information, lacking local details and positional information of the image. Hu et al. [16] and Xia et al. [17] proposed a fully connected layer feature extraction method based on multiple blocks or regions, in which the fully connected layer features of each block are extracted separately to cascade them after dividing the image into blocks. Then, these features were aggregated by using maximum pooling, mean pooling, and mixed pooling, and PCA was used to generate the low-dimensional features for image retrieval. The experimental results show that the fully connected layer features extraction with blocks method can solve the problem of providing positional information. Compared with the method of extracting fully connected layer features from the whole image, it can effectively improve retrieval performance.
Li et al. [30] proposed a feature extraction method for fully connected layers based on regions of interest (ROIs). First, the ROIs of the image were determined. Then, the fully connected layer features of the ROIs were extracted and further encoded by VLAD; finally, PCA was used to reduce the dimensionality of the features. The experimental results show that this method can obtain higher retrieval performance than the methods of extracting fully connected layer features from the whole image.
Another core part of CBIR is similarity measurement, in which distance measurement is the most commonly used method. Until now, many researchers have carried out the work on learning-based distance measurement. The learning-based distance measurement method is to learn embedding space Sensors 2020, 20, 4718 5 of 23 so that the distance of similar features is closer, and the distance of dissimilar features is further away. Ye et al. [31] used the similarity of image classes to sort the CNN feature distance of the query image and each retrieved image in ascending order to obtain the initial retrieval result. Then, the initial retrieval results are reordered by a weight, which is calculated from the query image and each class according to the initial retrieval results. The retrieval performance is superior to the state-of-the-art methods. Cao et al. [32] proposed a triple network that outputs the feature vectors of images, positive and negative samples and normalizes them. Finally, the distance of feature vectors is used to calculate the loss value, to get closer to the positive samples and further from the negative samples. The final retrieval performance is significantly better than the existing methods. Moreover, Zhang et al. [33] introduced correlation feedback based on feature weighting to further improve retrieval accuracy. First, the image representation is obtained from convolutional layer features or fully connected layer features. Then, the retrieval results are ranked by similarity measurement. Figure 1 shows that the deep features based on CNN mainly include two types: Convolutional layer features and fully connected layer features. The convolutional layer features contain more details that come from low levels of the CNN, and the fully connected layer features focus more on semantics that come from the high levels of the CNN.

 Convolutional layer features
The output of the CNN convolution layers is feature maps obtained by convolving the image with convolution kernels of various sizes and parameters. Since different convolution kernels have distinctness and diversity in the ability to describe image features, CNN can obtain more abundant image feature representation. In fact, the feature map of the convolutional layer cannot be directly used as an image descriptor, which usually needs to be compactly represented as a descriptor through a coding or pooling operation if it is applied for retrieval or classification.
Zhou et al. [15], Hu et al. [16] and Xia et al. [17] systematically conducted a comparative experiment on retrieval performance by using CNN convolutional layer features and fully connected layer features. In the comparative experiment, AlexNet [11], VGGNet [18], and GoogLeNet [19] were used as CNN backbone networks to extract convolutional layer features. Then, bag-of-words (BoW) [20], First, the image representation is obtained from convolutional layer features or fully connected layer features. Then, the retrieval results are ranked by similarity measurement.

The Proposed E-LargeVis Method
In this section, we first review the principle of the LargeVis dimensionality reduction method and then introduce the extended LargeVis method in our work.

Principle of LargeVis Dimensionality Reduction Method
Although t-SNE and its improved algorithms have been widely used, there are two shortcomings: (1) When processing large-scale high-dimensional data, t-SNE has low computational efficiency (including its improved algorithms); (2) t-SNE has a poor pervasive, for example, it cannot be applied to other datasets when adjusting the parameters on one dataset, and it takes a lot of time to re-adjust the parameters. Tang et al. [9] improved the t-SNE and proposed LargeVis dimensionality reduction algorithm. The main improvement points include an efficient kNN graph construction algorithm, a low-dimensional space visualization algorithm and an objective function.
Constructing an accurate kNN graph requires extremely high computation complexity because the distance between two data points needs to be calculated. LargeVis adopts a neighbor search method, which is divided into two steps. (1) After the space is divided by using a random projection tree, the k nearest neighbors of each point are found to obtain an initial kNN graph that does not require complete accuracy so as to speed up the calculation of the probability value of the sample point.
(2) The potential neighbors are found using the neighbor search algorithm; then, the distance of the neighbor and the current point, and the neighbor's neighbor and the current point are calculated to put into the root pile, in which the k nodes with the smallest distance are taken as k nearest neighbors; finally, an accurate kNN graph can be obtained.
Since LargeVis regards the current center point as the target in the original high-dimensional space, in which the center point and its neighbor nodes constitute a positive sample, and the center point and non-neighbor points constitute a negative sample. The weight of positive samples is the same as that defined in t-SNE. The conditional probability from data → x i to → x j is first calculated as: where σ i is selected by setting the complexity of the conditional distribution p |i equal to complexity u.
Then the graph is symmetrized by setting the weight between → x i and → x j as: In the low-dimensional space, the coordinate position is determined by the probability of observation. We first define the probability of observing a binary edge e ij = 1 between a pair of vertices as follows: where → y i and → y j are the embedding of the pair of vertices in the low-dimensional space, f (·) is a probabilistic function with respect to the distance of vertex → y i and → y j , i.e., → y i − → y j . When the two vertices are close in the low-dimensional space, there is a high probability of finding a binary edge between the two vertices. To further extend to general weighted edges, the possibility of observing a weighted edge e ij = ω ij is defined as follows: According to the above definition, given a weighted graph G = (V,E), the possibility of the graph can be calculated as: where E is the set of unobserved vertex pairs and a unified weight is assigned to the negative edges. The first part represents the possibility of the observed edges and will remain close together in the low-dimensional space by maximizing the similar data points. The second part represents the possibility of other vertex pairs without edges. Data from different classes will be further away from each other by maximizing the second part. By maximizing the equation, both goals can be achieved. According to the above objective function, optimization requires a lot of computational overhead because the number of negative edges is large, and directly training all negative edges will cause a further rise in complexity. Therefore, the LargeVis algorithm uses a negative sampling algorithm for optimization. For each vertex i, some vertices j are sampled randomly according to a noisy distribution P n ( j) and treat (i, j) as the negative edges, in which the probability meets the noise distribution as: where d j is the degree of j and M is the number of negative samples for each positive edge. The function can be redefined as: After using negative sampling and edge sampling optimization, LargeVis also uses the asynchronous stochastic gradient descent method for training, which effectively reduces the computation complexity of the algorithm.

Extended LargeVis Method
As mentioned above, the LargeVis algorithm is designed for the visualization of large-scale high-dimensional data. The distance of each point is essential so that LargeVis cannot achieve dimensionality reduction for high-dimensional data of a single image. Hence, the LargeVis algorithm is expanded by SVR to fit the implicit mapping function for reducing the dimensionality of high-dimensional data of a single image. Support vector regression (SVR) is an effective multiple regression method founded on the support vector machine, which can regress nonlinear problems well. Its goal is to find an optimal hyperplane and control the error between all training samples and the optimal hyperplane to achieve the regression analysis, as shown in Figure 2, in which ε is the fitting accuracy control parameter.
Sensors 2020, 20, x FOR PROOF 7 of 26 According to the above objective function, optimization requires a lot of computational overhead because the number of negative edges is large, and directly training all negative edges will cause a further rise in complexity. Therefore, the LargeVis algorithm uses a negative sampling algorithm for optimization. For each vertex i, some vertices j are sampled randomly according to a noisy distribution ( ) and treat (i, j) as the negative edges, in which the probability meets the noise distribution as: where is the degree of j and M is the number of negative samples for each positive edge. The function can be redefined as: After using negative sampling and edge sampling optimization, LargeVis also uses the asynchronous stochastic gradient descent method for training, which effectively reduces the computation complexity of the algorithm.

Extended LargeVis Method
As mentioned above, the LargeVis algorithm is designed for the visualization of large-scale highdimensional data. The distance of each point is essential so that LargeVis cannot achieve dimensionality reduction for high-dimensional data of a single image. Hence, the LargeVis algorithm is expanded by SVR to fit the implicit mapping function for reducing the dimensionality of highdimensional data of a single image. Support vector regression (SVR) is an effective multiple regression method founded on the support vector machine, which can regress nonlinear problems well. Its goal is to find an optimal hyperplane and control the error between all training samples and the optimal hyperplane to achieve the regression analysis, as shown in Figure 2, in which ε is the fitting accuracy control parameter. With SVR, as long as the value inside the dotted line can be regarded as a correct prediction, only the loss of the value outside the dotted line is calculated. The optimization problem of SVM regression is obtained by introducing the relaxation variables ≥ 0 and * ≥ 0: 0, * 0, = 1,2,3 … The Lagrange multiplier is introduced to obtain the linear fitting function as: With SVR, as long as the value inside the dotted line can be regarded as a correct prediction, only the loss of the value outside the dotted line is calculated. The optimization problem of SVM regression is obtained by introducing the relaxation variables ξ i ≥ 0 and ξ * i ≥ 0: Sensors 2020, 20, 4718 8 of 23 The Lagrange multiplier is introduced to obtain the linear fitting function as: where α i and α * i are Lagrange's multipliers. The above function can be obtained by adding a kernel function as: Compared with other fitting methods, SVR can produce optimal fitting accuracy. Therefore, we used SVR as the fitting method. As can be seen in Figure 3, fitting is divided into a training phase and a dimensionality reduction phase, in which the training phase is performed offline. The specific process is as follows: function as: Compared with other fitting methods, SVR can produce optimal fitting accuracy. Therefore, we used SVR as the fitting method. As can be seen in Figure 3, fitting is divided into a training phase and a dimensionality reduction phase, in which the training phase is performed offline. The specific process is as follows: Training phase: 1. Low-dimensional data of all training sets are obtained by LargeVis. It is necessary to reduce the high-dimensional data to a different dimensionality to meet different requirements. The sample pairs of training data are composed of high-dimensional and low-dimensional data, in the form with N dimensionality and is low-dimensional training data with M dimensionality, is the first dimension of , and is the Mth dimension of .
2. The sample pairs are input into the SVR to build a mapping model of high-dimensional data to low-dimensional data. In particular, each component of low-dimensional data needs one SVR fitting model. Dimensionality reduction phase: The appropriate SVR fitting model is selected according to the dimensionality reduction demands. The final dimensionality reduction data can be obtained by combining all fitting results in order, as shown in Figure 4.  The sample pairs are input into the SVR to build a mapping model of high-dimensional data to low-dimensional data. In particular, each component of low-dimensional data needs one SVR fitting model.

Dimensionality reduction phase:
The appropriate SVR fitting model is selected according to the dimensionality reduction demands. The final dimensionality reduction data can be obtained by combining all fitting results in order, as shown in Figure 4.
where and * are Lagrange's multipliers. The above function can be obtained by adding a kernel function as: Compared with other fitting methods, SVR can produce optimal fitting accuracy. Therefore, we used SVR as the fitting method. As can be seen in Figure 3, fitting is divided into a training phase and a dimensionality reduction phase, in which the training phase is performed offline. The specific process is as follows:  Dimensionality reduction phase: The appropriate SVR fitting model is selected according to the dimensionality reduction demands. The final dimensionality reduction data can be obtained by combining all fitting results in order, as shown in Figure 4.

High-Resolution Remote Sensing Image Retrieval Method
In this section, we apply the proposed E-LargeVis method to high-resolution remote sensing image retrieval. The overall process is shown in Figure 5, in which the fully connected layer features are extracted by using a channel attention-based ResNet50 as a backbone network. Then, E-LargeVis is used to reduce the dimensionality of the features to obtain a low-dimensional discriminative representation. Finally, L2 distance is computed for similarity measurement to realize the retrieval of high-resolution remote sensing images.

High-Resolution Remote Sensing Image Retrieval Method
In this section, we apply the proposed E-LargeVis method to high-resolution remote sensing image retrieval. The overall process is shown in Figure 5, in which the fully connected layer features are extracted by using a channel attention-based ResNet50 as a backbone network. Then, E-LargeVis is used to reduce the dimensionality of the features to obtain a low-dimensional discriminative representation. Finally, L2 distance is computed for similarity measurement to realize the retrieval of high-resolution remote sensing images. Figure 5. Channel attention-based high-resolution RSIR. Firstly, the fully connected layer features are extracted by SENet-ResNet50. Then, Extended LargeVis (E-LargeVis) was used to reduce the dimensionality of the features to obtain a low-dimensional discriminative representation. Finally, L2 distance was computed to measure similarity and realize the retrieval of high-resolution remote sensing images.

Channel Attention-Based ResNet50
In this section, we firstly make a detailed introduction to ResNet network architecture and then review the channel attention mechanism.

ResNet50 Network
The ResNet network was proposed by He et al. [27] in 2015. Its main contribution is to solve the problem that the classification accuracy decreases with the deepening of the CNN. Besides that, the proposed residual learning idea accelerates the CNN training process, which effectively avoids the problem of vanishing gradient and explosion.
Driven by the idea of residual learning, He et al. proposed a shortcut connection structure of identity mapping, as shown in Figure 6, where x is the input, H(x) is the desired underlying mapping, F(x) is residual mapping, and H(x) = F(x) + x. By transforming the network from fitting desired underlying mapping H(x) into fitting the residual mapping F(x), the output can be turned into a composition of the input and residual maps, making the network more sensitive to change between input x and output H(x). Figure 5. Channel attention-based high-resolution RSIR. Firstly, the fully connected layer features are extracted by SENet-ResNet50. Then, Extended LargeVis (E-LargeVis) was used to reduce the dimensionality of the features to obtain a low-dimensional discriminative representation. Finally, L2 distance was computed to measure similarity and realize the retrieval of high-resolution remote sensing images.

Channel Attention-Based ResNet50
In this section, we firstly make a detailed introduction to ResNet network architecture and then review the channel attention mechanism.

ResNet50 Network
The ResNet network was proposed by He et al. [27] in 2015. Its main contribution is to solve the problem that the classification accuracy decreases with the deepening of the CNN. Besides that, the proposed residual learning idea accelerates the CNN training process, which effectively avoids the problem of vanishing gradient and explosion.
Driven by the idea of residual learning, He et al. proposed a shortcut connection structure of identity mapping, as shown in Figure 6, where x is the input, H(x) is the desired underlying mapping, F(x) is residual mapping, and H(x) = F(x) + x. By transforming the network from fitting desired underlying mapping H(x) into fitting the residual mapping F(x), the output can be turned into a composition of the input and residual maps, making the network more sensitive to change between input x and output H(x).
To build a deeper network structure, He et al. [27] also conducted the bottleneck structure, as shown in Figure 7. To adapt to the deeper network structure, the bottleneck structure adds a 1 × 1 convolution to reduce the input dimensionality. The bottleneck structure is used in ResNet-50/101/152 networks. Lately, ResNet has been widely applied in various computer vision tasks, and has achieved outstanding performance. In this paper, ResNet50 was selected as the backbone network to extract the fully connected layer features of the image for image retrieval.

Channel Attention Mechanism
In the general structure of CNN, the convolutional layer does not take into account the dependence of the output results and the channel features. The basic idea of the attention mechanism is to allow the network to selectively enhance features with more contribution to the task as well as suppress unimportant features, in which the channel attention mechanism is one of the most commonly used attention mechanisms. The squeeze-and-excitation networks (SENet) proposed by Hu et al. [34] in 2017 is one of the representative works. The SENet block is shown in Figure 8.  To build a deeper network structure, He et al. [27] also conducted the bottleneck structure, as shown in Figure 7. To adapt to the deeper network structure, the bottleneck structure adds a 1 × 1 convolution to reduce the input dimensionality. The bottleneck structure is used in ResNet-50/101/152 networks. To build a deeper network structure, He et al. [27] also conducted the bottleneck structure, as shown in Figure 7. To adapt to the deeper network structure, the bottleneck structure adds a 1 × 1 convolution to reduce the input dimensionality. The bottleneck structure is used in ResNet-50/101/152 networks. Lately, ResNet has been widely applied in various computer vision tasks, and has achieved outstanding performance. In this paper, ResNet50 was selected as the backbone network to extract the fully connected layer features of the image for image retrieval.

Channel Attention Mechanism
In the general structure of CNN, the convolutional layer does not take into account the dependence of the output results and the channel features. The basic idea of the attention mechanism is to allow the network to selectively enhance features with more contribution to the task as well as suppress unimportant features, in which the channel attention mechanism is one of the most commonly used attention mechanisms. The squeeze-and-excitation networks (SENet) proposed by Hu et al. [34] in 2017 is one of the representative works. The SENet block is shown in Figure 8.  [34]. The size of the original feature map was H × W × C, where H is the height, W is the width, and C is the number of channels. Fsq(·) compressed the feature map from H × W × C to 1 × 1 × C; then Fex(·,W) learns the dependence of each channel and Fscale(·, ·) adjusts the feature map according to the dependence. Lately, ResNet has been widely applied in various computer vision tasks, and has achieved outstanding performance. In this paper, ResNet50 was selected as the backbone network to extract the fully connected layer features of the image for image retrieval.

Channel Attention Mechanism
In the general structure of CNN, the convolutional layer does not take into account the dependence of the output results and the channel features. The basic idea of the attention mechanism is to allow the network to selectively enhance features with more contribution to the task as well as suppress unimportant features, in which the channel attention mechanism is one of the most commonly used attention mechanisms. The squeeze-and-excitation networks (SENet) proposed by Hu et al. [34] in 2017 is one of the representative works. The SENet block is shown in Figure 8.
The channel attention mechanism is divided into three parts: Squeeze, excitation, and scale. First, squeeze is used to encode the feature map in the channel as a global feature, which can be implemented by global pooling, where u c represents the Cth convolution kernel in the convolution layer, and H and W represent the size of the convolution kernel: In the general structure of CNN, the convolutional layer does not take into account the dependence of the output results and the channel features. The basic idea of the attention mechanism is to allow the network to selectively enhance features with more contribution to the task as well as suppress unimportant features, in which the channel attention mechanism is one of the most commonly used attention mechanisms. The squeeze-and-excitation networks (SENet) proposed by Hu et al. [34] in 2017 is one of the representative works. The SENet block is shown in Figure 8.  [34]. The size of the original feature map was H × W × C, where H is the height, W is the width, and C is the number of channels. Fsq(·) compressed the feature map from H × W × C to 1 × 1 × C; then Fex(·,W) learns the dependence of each channel and Fscale(·, ·) adjusts the feature map according to the dependence.  [34]. The size of the original feature map was H × W × C, where H is the height, W is the width, and C is the number of channels. Fsq(·) compressed the feature map from H × W × C to 1 × 1 × C; then Fex(·,W) learns the dependence of each channel and Fscale(·, ·) adjusts the feature map according to the dependence.
After the squeeze operation, a global feature description is obtained. Then, the global description features are passed through an excitation function to learn the nonlinear relationship of the channels. The incentive mechanism in the form of Sigmoid is adopted as: where W 1 ∈ R C r ×C , W 2 ∈ R C× C r . We used two fully connected layer structures. The first fully connected layer structure can reduce the dimensionality to r dimensionality (r is a hyper-parameter). Here, dimensionality reduction can decrease the computational complexity of the model. The second full connection restores the dimensionality to the C dimensionality and aligns with the original number of convolution kernels. Finally, the activation value is multiplied correspondingly to the original feature channel as: Except for the hyper-parameter r in the entire calculation process, the remaining parameters are learned during the training process. The hyper-parameter r allows us to vary the capacity and computational cost in the network. Hu et al. [34] conducted experiments for a range of different r values. Setting r = 16 achieves a good balance between accuracy and complexity. In the learning process, the more useful feature channels for the task will be assigned a higher weight, which means that the representation ability of these "important features" has been enhanced. This structure can be integrated into many existing networks, such as Inception, ResNet, etc.
The channel attention structure is easy to implement. We integrated the channel attention structure into the ResNet50 network as a backbone network to extract features, named the SENet-ResNet50 network. The SENet-ResNet50 network residual module is shown in Figure 9.
In the embedding mode, the squeeze and excitation operations are performed on the ResNet residual module, and then the processed residual module is added to the identity map x as the output of this layer.
In this paper, first, the fully connected layer features were extracted from SENet-ResNet50, and then E-LargeVis was used to reduce the dimensionality of the features to avoid the "curse of dimensionality". Finally, the obtained low-dimensional features were utilized to perform similarity measurement. process, the more useful feature channels for the task will be assigned a higher weight, which means that the representation ability of these "important features" has been enhanced. This structure can be integrated into many existing networks, such as Inception, ResNet, etc.
The channel attention structure is easy to implement. We integrated the channel attention structure into the ResNet50 network as a backbone network to extract features, named the SENet-ResNet50 network. The SENet-ResNet50 network residual module is shown in Figure 9.  [34]. The global pooling layer and fully connected layers were added to the residual module to calculate the excitation parameters, and the features with channel attention were obtained from multiplied original features with excitation parameters. Figure 9. Channel attention-based ResNet residual module [34]. The global pooling layer and fully connected layers were added to the residual module to calculate the excitation parameters, and the features with channel attention were obtained from multiplied original features with excitation parameters.

Similarity Measurement
The Euclidean distance of feature vectors is adopted to measure the similarity of images, which is defined in Euclidean space. The Euclidean distance between two points x 1k (k = 1, 2 . . . n) and x 2k (k = 1, 2 . . . n) in N dimensional space is defined as follows:

Experimental Results and Analysis
To evaluate the performance of the proposed E-LargeVis dimensionality reduction method and the proposed high-resolution RSIR method of a channel attention-based ResNet50, we made comparisons with four high-resolution remote sensing image datasets including UCM, RS19, RSSCN7, and AID.

Datasets and Evaluation Metric
The changes in the surface of the Earth is usually a long-term developing period, and require high resolution in order to be properly recognized and retrieved [35]. UCM, WHU-RS, RSSCN7 and AID are currently the four most commonly used high-resolution remote sensing image datasets.
The images in the UCM dataset [36] come from the United States Geological Survey's city map, which contains a total of 21 categories including airplanes, beaches, buildings, and dense residential areas. Each category contains 100 images of 256 × 256 size, and the spatial resolution of each pixel is 0.3 m.
The WHU-RS19 dataset [37] is a remote sensing image dataset released by Wuhan University in 2011. The image size is 600 × 600, and it contains 19 types of scene images. Each type contains about 50 images, for a total of 1005 images.
The RSSCN7 dataset [38] is a remote sensing image dataset released by Wuhan University in 2015 and contains a total of 2800 images. These images come from seven typical scenes including grassland, forest, farmland, parking lot, residential area, industrial area and lake. Each category includes 400 images, which were collected from Google Maps during different seasons and weather changes, corresponding to four scales of sampling, each scale with 100 images with a size of 400 × 400. Due to the variety of scenarios, this dataset presents some challenges.
The AID dataset [39] is a remote sensing image dataset jointly released by Wuhan University and Huazhong University of Science and Technology in 2017 containing 30 types of scenes, each type containing 220-420 images. There is a total of 10,000 images, each sized 600 × 600.
We used mean average precision (mAP) [40] to evaluate retrieval performance, which is the accepted image retrieval performance evaluation index. The mAP is the mean value of average precisions of a set of queries, and it measures the average retrieval precision across all the query images. The mAP is defined as follows: The definition of AveP is: where P(k) is the accuracy rate and rel(k) is a piecewise function. When the kth image is a related image, its value is 1, otherwise, it is 0.

Experimental Setting
In our experiments, the dataset was randomly divided and experiments were repeated five times, with the final results being the average. In the experiment, 80% of the images were randomly selected from each dataset as training samples, and the remaining 20% of the images were used as test samples. The training samples were expanded by rotating the original image and its horizontal mirror image once every 45 • . The expanded dataset was 16 times the original size; this was used to train the CNN network model.
Our network was built and tested in the Keras open-source framework. The experimental platform uses Intel Core i7-8700 (Intel Corporation, Santa Clara, CA, USA), CPU 3.2 GHz, 32 GB memory, and contains an NVIDIA GeForce RTX 2080 Ti graphics card (NVIDIA Corporation, Santa Clara, CA, USA) for training and testing. The number of training iterations was set to 50 rounds, the batch size was set to 16, and the learning rate was 0.01. Momentum and weight decay methods were used to optimize the training process to prevent overfitting. The weight decay rate was 0.0001, and the momentum parameter was set to 0.9.

Experiment I: Performance Comparison of Different CNN
To verify the robustness and representation of the deep features extracted by different CNN architectures, we used AlexNet [11], VGG-16 [18], GoogLeNet [19], ResNet50 [26] and SENet-ResNet50 networks to extract fully connected layer features. In the experiment, for each CNN, the ImageNet dataset [26] was used for pre-training to obtain the initial parameters of the network model, and then the high-resolution remote sensing image dataset was used to fine-tune the initial parameters to obtain the network model. The fully connected layers of the network were extracted as deep features for image retrieval. Table 1 shows the dimensionality of the fully connected layer features extracted by the five CNN architectures.  Table 2 shows the comparison results of mAP with different CNN networks on four datasets. It can be seen from Table 2 that our SENet-ResNet50 can achieve a significant improvement in retrieval performance compared with the other four network architectures on different datasets. The mAP of SENet-ResNet50 was 96.64%, 97.69%, 85.10% and 89.03%, and the largest improvements compared to other methods were as follows: 33.75% higher than VGG-16 in the UCM dataset, 27.26% higher than GoogLeNet in the WHU-RS dataset, 10.88% higher than ResNet50 in the RSSCN7 dataset, and 25.52% higher than GoogLeNet in the AID dataset. This shows that the deep features extracted by the SENet-ResNet50 network architecture have a stronger representative and discriminative ability. In addition, compared with ResNet50, the mAP of SENet-ResNet50 increased from 93.76% to 97.69% for the WHU-RS dataset, and even from 74.22% to 85.10% for the RSSCN7 dataset, which shows that the channel attention mechanism can further improve the representation ability of ResNet50 deep features, thereby improving the retrieval performance.

Experiment II: Performance Comparison of SVR Regression Method
To verify the performance of SVR, we performed regression on the result of LargeVis by SVR, Ridge Regression and Lasso. The parameter kernel of SVR was set to "rbf", degree of the polynomial kernel function was set to 3 and the parameter gamma of SVR was set to "auto". The parameter cv of Ridge Regression and Lasso were set to the optimalresult from the -5 power of 10 to the 2 power of 10 with a length of 10, 20 and 30. The retrieval experimental comparison results of SVR and other regression methods were shown in Table 3. It can be seen from Table 3, SVR obtains better performance in all four datasets and dimensions. Ridge Regression and Lasso are classic regression methods, which are widely used for data regression. In this experiment, SVR is at least 0.58% higher than other methods in mAP of image retrieval. Since the results are from the regression of LargeVis, the retrieval performance is affected by the performance of CNN, LargeVis and regression. Therefore, SVR is chosen as the regression method in this paper.

Experiment III: Performance Comparison of E-LargeVis Dimensionality Reduction
To verify the performance of the proposed E-LargeVis dimensionality reduction method, we performed dimensionality reduction on the deep features extracted with five CNN. Figure 10  shows the comparison results of mAP in different dimensionality using the E-LargeVis dimensionality reduction method.
It can be seen from the experimental results that only the retrieval performance for the WHU-RS dataset was lower after using E-LargeVis, in which the maximum reduction was from 83.04% to 80.81% when compared with the VGG-16 network. In all other cases, using E-LargeVis improved the retrieval performance, with the largest increase from 74.22% to 91.45% (ResNet50 network with the RSSCN7 dataset). LargeVis can increase the distance of clusters that are far apart in the high-dimensional space after dimensionality reduction, so the low-dimensional features have a stronger discriminative ability for improving retrieval performance.
Through a comprehensive comparison of the retrieval performance under various dimensionalities, the proposed retrieval scheme can reach optimal performance with 64 dimensions considering the dimension as low as possible. Therefore, we set the dimensionality of E-LargeVis to 64. It can be seen from Table 3, SVR obtains better performance in all four datasets and dimensions. Ridge Regression and Lasso are classic regression methods, which are widely used for data regression. In this experiment, SVR is at least 0.58% higher than other methods in mAP of image retrieval. Since the results are from the regression of LargeVis, the retrieval performance is affected by the performance of CNN, LargeVis and regression. Therefore, SVR is chosen as the regression method in this paper.

Experiment III: Performance Comparison of E-LargeVis Dimensionality Reduction
To verify the performance of the proposed E-LargeVis dimensionality reduction method, we performed dimensionality reduction on the deep features extracted with five CNN. Figure 10 shows the comparison results of mAP in different dimensionality using the E-LargeVis dimensionality reduction method.

Experiment IV: Performance Comparison of Euclidean and Other Similarity Measurement Methods
To verify the performance of the Euclidean distance, we also compared it with the other classical similarity measurement methods such as Cityblock, Chebychev, Cosine, Correlation and Spearman. The experimental comparison results of Euclidean distance and other similarity measurement methods are shown in Table 4. It can be seen from Table 4 that the Euclidean distance obtains better performance in most cases. The result with the greatest difference is Correlation with 16 dimensions in the RSSCN7 dataset which is 92.41%, and is 0.21% higher than the Euclidean distance method. Considering all the results, the Euclidean distance is chosen as the similarity measurement method in this paper.

Experiment V: Performance Comparison of E-LargeVis and Other Dimensionality Reduction Methods
PCA is the most representative linear dimensionality reduction method, while LPP and LLE are the two representative nonlinear dimensionality reduction methods. To verify the performance of the E-LargeVis method, we also compared it with the other classical dimensionality reduction methods such as PCA [4], LPP [8], and LLE [7], in which the CNN architecture uses SENet-ResNet50. The number of neighbors of LPP and LLE was set to the optimal result of 12, 32, 64, 128 and 256. The experimental comparison results of E-LargeVis and other dimensionality reduction methods are shown in Table 5. It can be seen from Table 5 that the mAP of LPP method was 98.92%, which was 0.04% higher than E-LargeVis with 16 dimensions in the WHU-RS dataset. In all other cases, E-LargeVis can obtain much better retrieval performance. PCA is a linear dimensionality reduction method, so when dimensionality reduction is performed on nonlinear and high-dimensional deep features, PCA fails to obtain the best performance. Although LPP and LLE are also nonlinear dimensionality reduction methods, the performance of these two methods depends on the number of samples and the parameters. If the optimal configuration of parameters cannot be achieved, it is difficult to obtain good retrieval performance using these methods.

Experiment VI: Image Retrieval Results
The retrieval results of remote sensing images are shown in Figure 11. From the retrieval results, our method obtained better retrieval results compared to other methods, and the similarity ranking also roughly conforms to the human visual system. The images with high similarity to the query image rank first in the retrieval results.

Experiment VI: Image Retrieval Results
The retrieval results of remote sensing images are shown in Figure 11. From the retrieval results, our method obtained better retrieval results compared to other methods, and the similarity ranking also roughly conforms to the human visual system. The images with high similarity to the query image rank first in the retrieval results.

Experiment VII: Performance Comparison with the Existing Methods
To verify the effectiveness of our method, we compared it with the existing nine advanced highresolution RSIR methods. The retrieval performance comparison results using four datasets are shown in Table 6, in which the experimental results of the other methods in the table are referenced from the literature.

Experiment VII: Performance Comparison with the Existing Methods
To verify the effectiveness of our method, we compared it with the existing nine advanced high-resolution RSIR methods. The retrieval performance comparison results using four datasets are shown in Table 6, in which the experimental results of the other methods in the table are referenced from the literature. As can be seen from Table 6, our method obtained the optimal retrieval performance for all four datasets. For the most challenging RSSCN7 dataset, our method achieved a mAP of 92.68%. Our method adopts the SENet-ResNet50 network, which can obtain deep features with discriminative representation. Using E-LargeVis to reduce the dimensionality of the deep features not only reduces the computation complexity and saves storage space, but can further enhance the discriminative ability of deep features so as to obtain optimal retrieval performance.

Discussion
In this paper, an extended LargeVis (E-LargeVis) dimensionality reduction method for high-resolution RSIR was proposed, which can realize the dimensionality reduction of high-dimensional data of single image by modeling the implicit mapping relationship between LargeVis high-dimensional data and low-dimensional data with SVR. Then, we proposed a high-resolution RSIR method by using channel attention and E-LargeVis. First, the fully connected layer features were extracted by using a channel attention-based ResNet50 as a backbone network. Then, E-LargeVis was used to reduce the dimensionality of the features to obtain a low-dimensional discriminative representation. Finally, L2 distance was computed for similarity measurement to realize the retrieval of high-resolution remote sensing images.
Seven experiments were conducted to verify the effectiveness of our method. In Experiment I, AlexNet, VGG-16, GoogLeNet, ResNet50 and SENet-ResNet50 networks were used to extract FC layer features. For each CNN network, the ImageNet dataset was used for pre-training to obtain the initial parameters of the network model, and then the high-resolution remote sensing image dataset was used to fine-tune the initial parameters to obtain the network model. The number of training iterations was set to 50 rounds, the batch size was set to 16, and the learning rate was 0.01. Momentum and weight decay methods were used to optimize the training process to prevent overfitting. The weight decay rate was 0.0001, and the momentum parameter was set to 0.9. It can be seen from Table 2 that SENet-ResNet50 can achieve a significant improvement in retrieval performance compared with the other four networks on all four datasets. In Experiment II, we performed regression on the result of LargeVis by SVR, Ridge Regression and Lasso. The parameter kernel of SVR was set to "rbf", degree of the polynomial kernel function was set to 3 and the parameter gamma of SVR was set to "auto". The parameter cv of Ridge Regression and Lasso were set to the optimal result from the -5 power of 10 to the 2 power of 10 with a length of 10, 20 and 30. As can be seen in Table 3, SVR obtained better performance in all four datasets and dimensions, which was at least to 0.58% higher than other methods in mAP of image retrieval. In Experiment III, the E-LargeVis dimensionality reduction method was performed on the deep features extracted with five CNN. The dimension reduction was set to 16, 32, 64, 128 and 256. It can be seen from Figure 10 that the retrieval performance with the WHU-RS dataset dropped after using E-LargeVis, whereas it improved for all other cases. LargeVis can increase the distance of clusters that are far apart in the high-dimensional space after dimensionality reduction, so the low-dimensional features have a stronger discriminative ability to improve retrieval performance. In Experiment IV, we verified the performance of the Euclidean distance. Compared with the other classical similarity measurement methods such as Cityblock, Chebychev, Cosine, Correlation and Spearman, the Euclidean distance obtained better optimal in most cases. In Experiment V, PCA, LPP, LLE and E-LargeVis ware used to reduce the dimensionality of deep features of SENet-ResNet50. The number of neighbors of LPP and LLE was set to the optimal result of 12, 32, 64, 128 and 256. It can be seen from Table 5 that the mAP of LPP method was 98.92% which was 0.04% higher than E-LargeVis with 16 dimensions in WHU-RS dataset; however, E-LargeVis can obtain better retrieval performance for all other cases. PCA is a linear dimensionality reduction method, so when dimensionality reduction is performed on nonlinear and high-dimensional deep features, PCA fails to reach the best performance. Although LPP and LLE are also nonlinear dimensionality reduction methods, the performance of these two methods depends on the number of samples and the parameters. If the optimal configuration of parameters cannot be achieved, it is difficult to obtain available retrieval performance. In Experiment VI, the top five images from our retrieval method were shown. Our method obtained the best retrieval results, and the similarity ranking also roughly conformed to the human visual system. The images with high similarity to the query image ranked first in our retrieval results. In Experiment VII, we compared our method with the existing nine advanced high-resolution RSIR methods. As can be seen from Table 6, our method obtained the optimal retrieval performance for all four datasets. This experiment shows that using E-LargeVis to reduce the dimensionality of deep features not only reduces computational complexity and saves storage space, but also enhances the discriminative ability of deep features to obtain the optimal retrieval performance.

Conclusions
In this paper, an extended LargeVis (E-LargeVis) dimensionality reduction method was proposed for high-resolution RSIR. Our E-LargeVis method uses SVR to fit the implicit mapping model of high-dimensional to low-dimensional by LargeVis, aiming at the deficiency of the original LargeVis method that is its inability to reduce the dimensionality of high-dimensional data of a single image. Next, a high-resolution RSIR method was proposed by using E-LargeVis to reduce the dimensionality of the fully connected layer extracted with SENet-ResNet50. We evaluated the proposed E-LargeVis dimensionality reduction method and other retrieval methods on four high-resolution remote sensing image datasets. The experimental results showed that the E-LargeVis method can greatly improve retrieval performance. In future work, we will try to use other fitting methods to further improve the performance of E-LargeVis. At the same time, it will be applied to other fields such as fine image classification to further verify its effectiveness.