Deep Unsupervised Embedding for Remote Sensing Image Retrieval Using Textual Cues

: Compared to image-image retrieval, text-image retrieval has been less investigated in the remote sensing community, possibly because of the complexity of appropriately tying textual data to respective visual representations. Moreover, a single image may be described via multiple sentences according to the perception of the human labeler and the structure / body of the language they use, which magniﬁes the complexity even further. In this paper, we propose an unsupervised method for text-image retrieval in remote sensing imagery. In the method, image representation is obtained via visual Big Transfer (BiT) Models, while textual descriptions are encoded via a bidirectional Long Short-Term Memory (Bi-LSTM) network. The training of the proposed retrieval architecture is optimized using an unsupervised embedding loss, which aims to make the features of an image closest to its corresponding textual description and di ﬀ erent from other image features and vise-versa. To demonstrate the performance of the proposed architecture, experiments are performed on two datasets, obtaining plausible text / image retrieval outcomes.


Introduction
Remote sensing refers to remotely acquiring, interpreting, and possibly pinpointing information about the changes and manifestations that the earth's surface undergoes. It has been possible via observation platforms such as satellites and aerial systems. The significance of remote sensing has seen a rapid rise in the amount of data in several civilian and military applications.
The potential of remote sensing technology has been used in a variety of applications, including environmental assessment and monitoring, precision agriculture, renewable natural resources, military surveillance, meteorology, mapping, and reconnaissance [1]. Information in these applications is acquired via sensors mounted on large satellites, medium aerial vehicles, or even miniaturized drones (which can either be passive or active) [2].
The availability of remote sensing data, especially high-resolution images, has stimulated research in the remote sensing community. Typically, the primary research focus is on image classification and image retrieval [3][4][5]. Image classification is important because it allows the determination (and often semantic concepts and LSTM for representing the sentence. Niu et al. [29] adopted a tree-structured LSTM to learn the hierarchical relations between images and sentences, in addition to learning the relation between phrases and visual objects. Zhang et al. [30] developed a cross-modal projection matching/classification using a CNN to encode visual image features and LSTM to extract text features. It appears clearly from the literature of computer vision that the core problem in image retrieval is given a query one is interested to retrieve the most similar image in the database. This query can be an image or textual descriptions or a combination of them. Compared to image-image retrieval, text-image retrieval has been less investigated in the remote sensing community, possibly due to the complexity of appropriately tying textual data to respective visual representations. To cope with these limitations, the authors in [31] have proposed a new dataset for text-to-image matching named TextRS. They used a Deep Bidirectional Triplet Network (DBTN) for matching text to images based on CNN and LSTM networks. In this work, we propose an alternative approach based on asymmetric Siamese network. The first branch of this network uses BiT models for image representation, while the second branch relies on bidirectional Long Short-Term Memory (Bi-LSTM) for text encoding. The image and text representations are normalized and projected in a low dimensional space. The embedding features of the pair image-text should be invariant, while the features of different images and text instances should spread-out. The experimental results obtained on TextRS and Merced datasets are reported and discussed.
The paper is structured as follows. In Section 2, we introduce the proposed methodology. While, in Section 3, we present the experimental results. Finally, conclusion and future developments for the proposed work are declared in Section 4.

Proposed Methodology
The proposed architecture addresses the task of text-to-image matching. This task aims to retrieve the matching image/sentence that resides in a training set D prepared offline.
Let us assume a training set consisting of N images alongside their respective sentences. In the test phase, given a query sentence t q , we aim to retrieve the most relevant image from the training set D (Figure 1a). On the side, in the image-to-text retrieval scenario, however, a query image is presented at the text-to-image matching model to retrieve the most likely textual description ( Figure 1b). Figure 2 gives the detailed architecture of the proposed method, which is divided into two branches for learning appropriate image and text embedding, i.e., f (X i ) and g(T i ). Further details are provided in the subsequent sections.

Image Representation Using (BiT) Models
The backbone of image-embedding module is based on BiT models (i.e., BiT-S, BiT-M, and BiT-L) [32]. These BiT models are trained on three upstream datasets with different scales: ImageNet-1k [33] (BiT-S), ImageNet-21k (BiT-M), and JFT-300 M [34] (BiT-L). While ImageNet-1k is a dataset designed for ILSVRC image classification task, which is composed of more than 1.28 M images and 1k classes. On the other hand, ImageNet-21k is a large-scale few-shot dataset that contains 14 M images and 21k classes; it is also called the full ImageNet. The JFT-300 M dataset is a subsequent version of the dataset introduced in [35,36]. It has 300 M real-world images and 18k classes with each image approximately having 1.26 labels, resulting in a total of 375 M labels. Note that these BiT models yield state-of-the-art performances on several benchmark datasets for transfer learning with 928 M parameters.
BiT models adopt a standard ResNet-v2 [37] architecture with different sizes (i.e., a ResNet-50 (R50 × 1), a ResNet-50 that is three times wider (R50 × 3), a ResNet-101 × 1, a ResNet-101 that is three times wider (R101 × 3), and a ResNet-152 that is four times wider (R152 × 4)), and some updates. Unlike the standard ResNet architecture, BiT models are based on the idea that two new layers, called group normalization (GN) and weight normalization (WN), supersede batch normalization (BN) in all convolutional layers ( Figure 3). Note that there are other normalization methods, such as layer norm (LN) and instance norm (IN), which can be considered as extreme cases of GN. Figure 4 illustrates the various normalization techniques and the relations between BN, LN, IN, and GN.  GN has proven to be effective in many applications, such as detection and segmentation [38] and video classification [39], which makes it a better alternative for BN. It has been proposed with a layer that divides the channels into groups, and then the normalized features in each group according to the mean and the variance of the group. It has been shown that GN is more stable than BN with respect to batch size. This is because in GN, calculations of the batch statistics are inherently avoided. Furthermore, the batch dimension is not exploited in GN and it can be transferred to fine-tuning from pre-training irrespective of batch size changes.
In this work, we use these BiT models for image representation. In particular, we feed the image as input to the network and extract the corresponding feature representation before the classification layer. Then these feature are further projected and normalized into a low dimensional feature space using a dense layer as shown in Figure 2.

Text Representation Using Bi-LSTM
The sentence is fed through a word embedding layer followed by Bi-LSTM. Note that Bi-LSTM [40] is an extension of conventional LSTMs, while an LSTM network is a modified version of RNNs.
These last are based on the idea of connecting current and previous information, which enables understanding of the sequence of data [41]. RNNs tries to remember information they learned during the training procedure as well as what they learned from the previous input. This is achieved by repeatedly applying transformations to the input sequence data. After the output has been generated, it is copied and returned to the recurrent network.
Although RNNs have been used in various tasks, they encounter major problems such as gradient vanishing and exploding. To cope with these limitation the LSTM network [42] has been introduced as an alternative solution. The LSTM network depends on the so-called memory cell, which can learn (make decisions) how to allow data to be entered, left, or removed from the cell state through an iterative process. This is done at a time step through special structures called gates. The LSTM has three gates, i.e., input, forget, and output gates. The input gate controls how the input data would change the state of the memory cell. The forget gate controls the cell of the previous state, whether it has to remember or forget it. The output gate enables the memory cell to influence the outputs. The equations for the gates are as follows: As mentioned previously, in this work, we use Bi-LSTM which comprises two LSTMs. During the training process, one of the LSTMs is trained on the input sequence and the other one is trained on the reversed copy of the input sequence. In other words, the input sequence is processed in both forward (past to future) and backward (future to past) directions by using two recurrent networks (two separate hidden layers, a forward state sequence and a backward state sequence). Both of the networks then connect to the same output layer to generate the output. Similar to the image branch, the output of Bi-LSTM is fed to a fully connected layer followed by l 2 -normalization yielding a feature representation f (Y i ) for the sentence Y i .

Optimization
Retrieval tasks are usually solved by learning a distance metric [43]. Inspired by accomplishments of deep learning in computer vision [26], deep neural networks have been used to learn how to embed discriminative features useful for learning these distances [44,45]. In this case, the embedded features of similar samples should be closer, while those of dissimilar samples should be farther. The literature of computer vision convoys several loss functions such as triplet [45], quadruplet [46], lifted structure [47], N-pairs [48], and angular [49] losses. In this work, we extend the softmax embedding variant mainly proposed for image-to-mage retrieval to the case of text/image retrieval. Because of memory requirements, we learn iteratively these distances on small batches sampled from the complete dataset. The aim is to make the features of an image closest to its corresponding textual description and different from other image features and vise-versa.
If we consider as the k th mini-batch of size m sampled instances from the full dataset D, then for a sentence Y i , its corresponding image X i should be classified into instance i, while other images X j , with j i are not classified into instance i. Therefore, the probability of the image X i being recognized as instance i is defined by: where τ is a temperature parameter controlling the sharpness of the distribution. Similarly, the probability of a sentence Y j not being assigned to instance i can be defined by 1 − P i Y j where Then under the assumption that different sentences being recognized as instance i are independent, the joint probability image X i being recognized as instance i and the sentence Y j not assigned to instance i where j i is simply P(i|X i ) j i 1 − P i Y j .
Then the corresponding negative log-likelihood for the min-batch B k can be given as follows: The corresponding total log-likelihood J over the entire dataset is: Similarly, we can extend this formulation in the reverse direction that form image X i to sentence Y i yielding the following log-likelihood: Then the total log-likelihood used as loss for learning the parameters of the proposed asymmetric Siamese network is given by: where λ 1 and λ 2 are balancing weights.

Dataset Description
In this work we use two different benchmark datasets to validate the performances of the proposed method. The first one is the TextRS dataset [50], which consists of 2144 images collected from several scene datasets (i.e., AID [31], Merced [51], PatternNet [52], and NWPU [53]) (see Figure 5). In particular, this dataset is built by selecting randomly 16 images from each class of four popular heterogeneous scene datasets: AID (30 classes), UC Merced (21 classes), PatternNet (38 classes), and NWPU (45 classes). TextRS has 2144 images: 720 images with spatial resolution 0.2 to 30 m, 608 images with spatial resolution 0.062 to 4.7 m, 480 images with spatial resolution 0.5 to 8 m, and 336 images with spatial resolution 30 m as shown in Table 1. Each remote sensing image is annotated by five various sentences; therefore, the total number of sentences is 10,720. The second dataset is the Merced Land-Use dataset, which consists of remote sensing images with 21 classes [54]. Each class has 100 RGB images (256 × 256 pixels). The total number of images is 2100, with every image labeled also with five different sentences (see Figure 5b).

Experimental Setup
We implemented the proposed method using Tensorflow-keras. We divided the TextRS dataset according to [32], while we randomly split the Merced dataset into: 80% is for training and 20% is for testing. We use a mini-batch size of 50 images for training the network. For optimization, we use Adam optimizer with its default parameters. In addition, we set the regularization parameters λ 1 and λ 2 controlling the contribution of the two losses to 1. We train the models for 50 iterations. For performance evaluation of the proposed method, we use the Recall@K (R@K) metric, which is widely used to match the scores of an image with a query sentence and vice versa. In the subsequent sections, we present the results in terms of R@K for different values of (K = 1, 5, and 10) calculated as follows: R@k = (true positives @k)/((true positives @k) + (false negative @k)) All experiments are conducted on a workstation with an Intel Core i9 processor with a speed of 3.6 GHz, 64 GB of memory, and a GPU (with 11 GB GDDR5X memory). Table 2 shows the results obtained with our model using m-R50x1 as a pre-trained CNN for the image branch. In the case of text-to-image retrieval, the scores (R@1, R@5, and R@10) are 19.02%, 55.25%, and 71.72%, respectively, on the TextRS dataset and they are 21.86%, 60.00%, and 75.58%, respectively, for the Merced dataset. On the other hand, the image-to-text matching scores are 22.95%, 59.52%, and 77.23%, respectively, on the TextRS dataset and 25.47%, 59.76%, and 72.61%, respectively, on the Merced dataset. We observe that retrieval accuracy increases significantly when going from R@1 to R@10, which indicates the difficulty of getting the exact match for the first retrieved query. In Figure 6, we show two successful scenarios for query sentences with their corresponding ground-truth images. The retrieval results show that the output images almost have the same objects. On the other hand, Figure 7 shows two unsuccessful scenarios, where the true matched image was not retrieved correctly. It is worth recalling that during training, we learn an embedding loss that aims to learn close representations of images with the same descriptions. However, this task is very challenging as the dataset is not very large (each image is associated with a textual description).

Sensitivity Analysis
We next investigated the results using different models pre-trained on two types of ImageNet datasets (i.e., ImageNet-21k and ImageNet-1k). Table 3 shows the results on the TextRS dataset using three different BiT architectures (m-R50x1, m-R50x3, and m-R101x1). As can be seen the models pre-trained on ImageNet-21k yields better results compared to the model pre-trained on ImageNet-1k. In particular, the obtained R@1, R@5, and R@10 values for the text branch are 19.02%, 55.25%, and 71.72%, respectively. On the other hand, the obtained values for the image branch are 21.86%, 60.00%, and 75.58%, respectively. These results suggest that the modes pre-trained on ImageNet-21 are more suitable compared to the one pre-trained on ImageNet-1k. In terms of computation complexity and accuracy, the m-R50 x 1 seems a good a choice compared to the other models. Table 3. Text-to-image and image-to-text retrieval results on the TextRS dataset.

Pre-Trained Architecture
Text-to-Image Image-to-Text Regarding Merced dataset (Table 4), we observe that for the largest model m-R101x1 it yields 23.68%, 60.38%, and 78.52%, respectively for text-to-image retrieval. While for image-to-text retrieval, the scores are 21.86%, 60.00%, and 75.58%, respectively. In comparison, using ImageNet-1k, the obtained R@1, R@5, and R@10 values are 19.71%, 56.04%, and 74.76%, respectively. Here again, the models pre-trained on ImageNet-2k exhibits a better behavior with m-R50x1 as a good compromise between accuracy and computation complexity. In order to assess further the proposed model, we propose to carry additional experiments by changing the size of the fully connected layer from 256 to 512. The results obtained in Tables 5 and 6 suggest that setting the number of neurons 256 yields in general better results. Finally, we compare our results to the method based on triplet networks proposed recently in [31] (Table 7). As can be seen the proposed method yields better retrieval results in terms of R@1, R@5, and R@10 scores. For instance, our method yields on TextRS dataset scores of 19.02%, 55.25%, and 71.72% versus 14.18%, 44.18%, and 62.55% for the method based on triplet networks. Similarly, our method obtains on Merced dataset scores of 22.76%, 58.47%, and 78.38%, while the one based on triplet networks gives 18.52%, 50.12%, and 69.20%.

Conclusions
In this work, we have proposed an unsupervised learning method for image retrieval in remote sensing imagery. Unlike traditional remote sensing image-to-image retrieval, this approach addresses the problem of text-to-image retrieval. The network consists of two asymmetric branches for image and sentence encoding, respectively. The experiments provided on two different benchmark datasets show interesting results compared to a recent method based on triplet networks. For future developments, we propose to investigate other embedding models in addition to more robust losses to increase the retrieval accuracy.