A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval

: With the rapid progress of remote sensing (RS) observation technologies, cross-modal RS image-sound retrieval has attracted some attention in recent years. However, these methods perform cross-modal image-sound retrieval by leveraging high-dimensional real-valued features, which can require more storage than low-dimensional binary features (i.e., hash codes). Moreover, these methods cannot directly encode relative semantic similarity relationships. To tackle these issues, we propose a new, deep, cross-modal RS image-sound hashing approach, called deep triplet-based hashing (DTBH), to integrate hash code learning and relative semantic similarity relationship learning into an end-to-end network. Specially, the proposed DTBH method designs a triplet selection strategy to select effective triplets. Moreover, in order to encode relative semantic similarity relationships, we propose the objective function, which makes sure that that the anchor images are more similar to the positive sounds than the negative sounds. In addition, a triplet regularized loss term leverages approximate l 1 -norm of hash-like codes and hash codes and can effectively reduce the information loss between hash-like codes and hash codes. Extensive experimental results showed that the DTBH method could achieve a superior performance to other state-of-the-art cross-modal image-sound retrieval methods. For a sound query RS image task, the proposed approach achieved a mean average precision (mAP) of up to 60.13% on the UCM dataset, 87.49% on the Sydney dataset, and 22.72% on the RSICD dataset. For RS image query sound task, the proposed approach achieved a mAP of 64.27% on the UCM dataset, 92.45% on the Sydney dataset, and 23.46% on the RSICD dataset. Future work will focus on how to consider the balance property of hash codes to improve image-sound retrieval performance.


Introduction
With the development of remote sensing (RS) observation technologies, the amount of RS data is increasing rapidly [1,2]. Nowadays, RS data retrieval has attracted wide attention in the RS research field [3,4]. It can retrieve useful information in large scale RS data and has wide application prospects in disaster rescue scenarios [5,6]. Generally speaking, RS data retrieval can be roughly divided into uni-modal RS retrieval methods and cross-modal RS retrieval methods. Uni-modal RS retrieval methods [7][8][9][10][11][12][13] aim to search the RS data with a similar concept to queried RS data where all RS data come from the same modality. For example, Ye et al. [13] developed a flexible multiple-feature hashing learning framework, which maps multiple features of the RS image to the low-dimensional binary feature. Demir et al. [4] developed a hashing-based search approach to perform RS image retrieval in large RS data archives. Li et al. [5] presented a novel partial randomness strategy for hash codes learning in large-scale RS image retrieval. Cross-modal RS retrieval methods [14] aim to search the RS The comparison of (a) exiting image-sound retrieval methods [14,25] and (b) the proposed DTBH approach.
In fact, queried RS images (respectively, sounds) are better matched with relevant RS sounds (respectively, RS images) if more relative similarity relationships of RS images (respectively, sounds) are understood by humans [26]. Clearly, the above issues can be tackled if we teach relative semantic similarity relationships and low-dimensional binary features simultaneously. Inspired by this idea, a deep cross-modal RS triplet-based hashing method was developed to perform relative semantic similarity relationship learning and hash codes learning simultaneously for an image-sound retrieval application.
In this paper, we propose a new deep cross-modal RS image-sound hashing approach, called deep triplet-based hashing (DTBH), to integrate hash code learning and relative semantic similarity relationship learning into an end-to-end network, as shown in Figure 2. The whole framework contains the RS image branch, the positive RS sound branch, and the negative RS sound branch. To reduce storage costs, our proposed method exploits deep nonlinear mapping to project RS images and sounds into the common Hamming space. Then, we implement cross-modal image-sound retrieval by using hash codes, which speeds up low storage. To learn relative semantic similarity relationships, we utilize triplet labels to supervise hash code learning. Compared with pairwise labels, triplet labels can capture higher-level similarities in various situations, rather than only the similar/dissimilar situations, as in the circumstance of pairs. Furthermore, a triplet selection strategy was also designed to capture the intra-class and inter-class variations, which is helpful for learning hash codes. In addition, we designed a new objective function, which consists of the triplet similarity loss function, the triplet regularized loss function, and deep feature triplet loss function. The deep feature triplet loss function ensures that the anchor deep features are more similar to the positive deep features than to the negative deep features. The triplet regularized loss function makes hash-like codes more and more similar to hash codes by reducing the information loss. Extensive experimental results show that DTBH method can achieve superior performance to other cross-modal image-sound retrieval methods. The contributions can be summarized in the following four aspects: 1. A new deep cross-modal triplet-based hashing framework is proposed to leverage triplet similarity of deep features and triplet labels to tackle the issue of insufficient utilization of relative semantic similarity relationships for RS image-sound similarity learning. To the best of our knowledge, it is the first work to use hash codes to perform cross-modal RS image-sound retrieval.

2.
A new triplet selection strategy was developed to select effective triplets, which is helpful for capturing the intra-class and inter-class variations for hash codes learning. 3. The objective function was designed to learn deep features' similarities and reduce the information loss between hash-like codes and hash codes. 4. Extensive experimental results of three RS image-sound datasets showed that the proposed DTBH method can achieve superior performance to other state-of-the-art cross-modal image-sound retrieval methods.
The remainder is organized into the following four parts. Section 2 presents the detailed procedure of the proposed DTBH method. Section 3 elaborates the experimental results. Section 4 presents the conclusions of the proposed DTBH method.

The Proposed Method
In this part, Section 2.1 clarifies the problem definition. The details of multimodal architecture are presented in Section 2.2. Section 2.3 introduces the triplet selection strategy. Section 2.4 elucidates the objective function of the proposed DTBH method.  [27]. The definitions of some symbols are shown in Table 1.

Multimodal Architecture
The proposed deep triplet-based hashing approach for RS cross-modal image-sound retrieval is shown in Figure 2. The whole approach consists of RS image modality and RS sound modality.
RS Image Modality: The configuration of the RS image modality is demonstrated in Figure 2. We leverage the convolution architecture of VGG16 [14] as the convolution architecture of the RS image branch. Then, the following layers of the RS image modality consist of two fully connected layers. The first fully connected layer is deep features layer, which consists of 1000 units and utilizes a sigmoid function as the activation function [28][29][30]. The second fully connected layer is a hash layer, which consists of K units and leverages tanh function as the activation function [31]. The second fully connected layer can generate K-bits hash-like codes, which can be utilized to produce K-bits hash codes by the quantization function [32]. Then hash codes b a m of the RS anchor image I a m can be given by h a m = H a (I a m ) = sign(q a m ) = sign(Γ( f a (I a m ), W a )), where sign(.) denotes the element-wise sign function; i.e., sign(x) = 1 if x > 0; otherwise, sign(x) = −1. h a i represents K-bits hash codes. q a m represents K-bits hash-like codes, which are the outputs of the hash layer. Γ represents the tanh function. F a (I a m ) represents deep features, which are the outputs of the deep features layer. W a represents the weights of the second fully connected layer.
RS Sound Modality: Similar to [33,34], Mel-frequency cepstral coefficients (MFCC) are utilized to delegate the RS voices, because MFCC uses cepstrum feature extraction [35,36], which is more in line with the principle of human hearing. So it is the most common and effective voice feature extraction algorithm [37,38]. We use a sixteen millisecond window size with five millisecond shift to compute MFCC. Furthermore, the size of MFCC is compulsively extracted as the length g, truncated more than the length g, and padded zero less than the length g. The configuration of the RS sound modality is demonstrated in Figure 2. The RS sound modality contains two identical subnetworks. And the convolution architecture of the subnetwork shares weight parameters. The first convolution layer of the subnetwork utilizes the filters with the width of one frame across the whole frequency axis. Then, the following layer of the subnetwork contains three 1D convolutions with max-pooling. The filters of three convolutions exploit 32, 32, and 64, respectively. The respective widths of three convolutions use 11, 17, and 19, respectively. All max-pooling operations utilize two strides. The last two layers of the subnetwork consist of two fully connected layers, which do not share weight parameters. The first fully connected layer is the deep features layer, which consists of 1000 units and utilizes the sigmoid function as the activation function. The second fully connected layer is hash layer, which consists of K units and leverages the tanh function as the activation function. The second fully connected layer can generate K-bits hash-like codes, which can be utilized to produce K-bits hash codes by the quantization function. For the RS positive sound, hash codes b p m of the RS positive sound S p m can be given by where h p m represents k-bits hash codes of the RS positive sound S p m . q p m represents k-bits hash-like codes, which is the output of the hash layer. F p (S p m ) represents deep features, which is the output of the deep features layer for the RS positive sound as input. W p represents the weights of the second fully connected layer for the RS positive sound as input. For the RS negative sound, hash codes b n m of the RS positive sound S n m can be given by h n m = H n (I n m ) = sign(q n m ) = sign(Γ( f n (S n m ), W n )), where h n m represents K-bits hash codes of the RS positive sound S n m . q n m represents K-bits hash-like codes, which is the output of the hash layer. F n (S n m ) represents deep features, which is the output of the deep features layer for the RS negative sound as input. W n represents the weights of the second fully connected layer for the RS negative sound as input.

Triplet Selection Strategy
Previous cross-modal RS retrieval approaches [14,39] did not consider the construction of samples. And these approaches cannot achieve superior cross-modal retrieval performance. To improve cross-modal retrieval performance, we designed a novel triplet selection strategy that randomly selects one hard negative sound for positive image-sound pair in a negative sound set. The triplet selection strategy can be formulated as where I = {I m } M m=1 represents the RS anchor images; S p = {S p m } M m=1 represents the RS positive sounds. τ(S n m ) represents the random function, which randomly choose one negative sound from the hard negative sounds set S n = {S n m : and ε represents the l 2 -norm vector and the margin parameter, respectively. f a (I m ) represents deep features for the image I m . f p (S p m ) represents deep features for the positive sound S p m . f n (S n m ) represents deep features for the negative sound S n m . Triplet selection strategy is helpful to grasping the relative relationship between samples and contributes to learning effective hash codes.

Objective Function
Compared with cross-modal retrieval approaches using pairwise loss [14,39], the proposed DTBH method leverages the triple loss to learn the relative similarity relationship between RS images and sounds, because the relative similarity relationship established by the triple loss can be more reasonable than the absolute similarity relationship exploited by the pairwise loss. The pairwise loss captures the intra-class variations and inter-class variations, respectively. But the triple loss can capture the intra-class variations and inter-class variations simultaneously. The goal of the proposed DTBH method is to learn a hash function that can project samples into hash codes while maintaining the similarity of matched RS images and sounds. For this goal, the anchor image I a m and the positive sound S p m are as close as possible, while the anchor image I a m and the negative sound S n m are as far apart as possible. Inspired by [40], the triplet similarity loss function can be defined as where T represents the triplet similarity loss, which ensures that the anchor images are more similar to the positive sounds than the negative sounds. H d (·, ·) represents the Hamming distance; h a m represents the hash code of the anchor image I m . h n m represents the hash code of the negative sound S n m . h p m represents the hash code of the positive image S p m . ε represents the margin parameter. max(·) represents the maximum function.
Directly optimizing Equation (5), it is difficult to calculate derivatives in network training process. To solve this problem, the new relaxation strategy is adopted to replace the Hamming distance of discrete hash codes with l 2 -norm of hash-like codes [41]. Then, the triplet similarity loss function is redefined as where ε represents the margin parameter. || · || 2 represents the l 2 -norm vector. And q a m represents the hash-like code of the anchor image I a m , which is defined as q a m = Γ(F a (I a m ), W a ), where Γ represents the tanh function. F a (I a m ) represents deep features, which are the outputs of the deep features layer. W a represents the weights of the second fully connected layer for the anchor image I a m . And q p m represents the hash-like code of the anchor image S p m , which is defined as q where F a (S n m ) represents deep features for the negative sound S n m . W p represents the weights of the second fully connected layer for the negative sound S n m . Nevertheless, the new relaxation strategy above will lead to the information loss between hash-like codes and hash codes. It is necessary to design a regularized term between hash-like codes and hash codes to reduce the information loss. Motivated by iterative quantization (ITQ) [42], a new triplet regularized loss was developed to reduce the information loss. The triplet regularized loss is given as where R represents the triplet regularized loss, which makes hash-like codes more and more similar to hash codes by reducing the information loss. || · || 2 represents l 2 -norm vector.
The above Equation (7) utilizes l 2 -norm to reduce the information loss. Compared with l 2 -norm, l 1 -norm requires less computation and encourages sparsity for hash code learning [43,44]. Then, Equation (7) can be reformulated as where || · || 1 represents the l 1 -norm vector. Furthermore, Theorem1 reveals that the minimization of l 1 -norm between hash-like codes and hash codes is the upper bound of the l 2 -norm between hash-like codes and hash codes. The detailed proof of Theorem1 is presented below.
The relationship between h p m and q p m can be given as and the relationship between h n m and q n m can be given as According to Equations (10)-(12), we can drive that Overall, However, R makes it difficult to calculate derivatives for the network architecture of the DTBH approach. Inspired by [45], the smooth surrogate of the absolute function |x| = log cosh x is exploited in the network architecture of the DTBH approach. Equation (8) can be written as where h a m(k) denotes the k-th position of hash codes h a m(k) for RS image I a m ; q a m(k) denotes the k-th position of hash-like codes q a m(k) for RS image I a m . K represents the length of hash codes. | · | represents the absolute value operation. To further enhance the relationship between hash codes of RS images and hash codes of RS sounds, the similarity of deep features is taught in the network architecture of the DTBH approach, because the similarity of deep features will promote the learning of similarity relations between RS images and RS sounds. Then, deep features' triplet loss can be given as where D denotes deep feature triplet loss, which can preserve the similarity of deep features. || · || 2 represents the l 2 -norm vector. ε represents the margin parameter.  (6), (16) and (17), the overall objective function of the DTBH approach can be defined as where α and β denote the trade-off parameters. denotes the overall objective function, which consists of the triplet similarity loss function T , the triplet-regularized loss R , and deep feature triplet loss D . The objective function was optimized by Adam [46]. The detailed algorithmic procedure of DTBH is shown in Algorithm 1. The triplet similarity loss function T ensures that the RS anchor images are more similar to the RS positive sounds than to the RS negative sounds. The triplet regularized loss R makes hash-like codes more and more similar to hash codes by reducing the information loss. Deep features triplet loss D can preserve the similarity of deep features.

Algorithm 1 Optimization algorithm for learning DTBH.
Input: Output: The parameters W of the DTBH approach; Initialization: Utilize glorot_uniform distribution to initialize W.

Experiments
In this section, Section 3.1 describes three RS image-sound datasets and evaluation protocols. Section 3.2 introduces the detailed implementation of the proposed DTBH method. Section 3.3 presents evaluation of different factors for the proposed DTBH method. Section 3.4 describes the experimental results. Section 3.5 discusses the parameter analysis of the proposed DTBH method.

Dataset and Evaluation Protocols
To prove the validity of the proposed DTBH method, three RS image-voice datasets were exploited to compare the DTBH method with other cross-modal image-voice methods. (1) UCM dataset [47] contains 2100 RS image-sound pairs. Note that the dataset consists of 2100 RS images of 30 classes; each RS image has one corresponding sound. We leveraged the triplet selection strategy to construct 6300 triplet units. (2) The Sydney dataset [47] consists of 613 RS image-sound pairs. Note that the dataset consists of 613 RS images of seven classes; each RS image has one corresponding sound. Triplet selection strategy is leveraged to construct 1839 triplet units. (3) RSICD dataset [48] consists of 10,921 RS image-sound pairs. Note that the dataset consists of 10,921 RS images of 30 classes; each RS image has one corresponding sound. We exploited the triplet selection strategy to construct 32,763 triplet units. Following [14], we randomly selected 80% RS image-sound triplets as the training data and the other 20% RS image-sound triplets as the testing data for these three datasets. In the testing process, we use testing RS images (resp. sounds) as the query data and testing RS sounds (resp. RS images) as the gallery data. Some example images and sounds from three RS image-sound datasets are shown in Figure 3. Moreover, to evaluate the validity of the proposed DTBH method, the DTBH method was compared with SIFT+M, DBLP [39], convolutional neural network and spectrogram (CNN+SPEC) [22], and deep visual-audio network (DVAN) [14]. Note that the DTBH method uses 64 bit hash codes; the method SIFT+M projects SIFT features of images and MFCC features of voices into a common feature space by exploiting deep neural networks.These methods-DBLP [39], CNN+SPEC [22], and DVAN [14]-were implemented in this study. Following [14], similar images and sounds can be considered as the ground-truth neighbors. These evaluating metrics-mean average precision (mAP) and the precision in top m of the ranking list (precision@m)-were exploited for assessing the experimental results [49][50][51][52]. Precision represents the proportion of the correct number of samples to the total number of samples in the ranking list [53]. If the real-values of these metrics are bigger, the retrieval results of the method are better [54].
Two airplane of the same kind are stopped at the airport .
A regular baseball diamond compose of manicured lawns and sand.

Many houses arranged neatly with a main street go across this area .
An industrial area with some white buildings densely arranged while some roads go through.

Implementation Details
The proposed DTBH method was carried out by the open-source KERAS library. The experiments were implemented on workstation with GeForce GTX Titan X GPU, Inter Core i7-5930K, with a 3.50 GHZ CPU and 64 GB RAM. For MFCC, the parameter g was fixed as 2000. The overall objective function can be optimized by Adam [46] with the learning rate 10 −3 . The initial weights of the DTBH approach exploited the glorot_uniform distribution. The batch size of the DTBH approach was fixed as 64. The parameter α was fixed as 1. The parameter β was fixed as 0.01. To produce {16, 24, 36, 48, 64}-bit binary codes, K values were make to be from 8 to 64, respectively. The proposed DTBH approach can be trained for 5000 epoches, or to keep training until the loss does not diminish [55].

Evaluation of Different Factors
To evaluate the effectiveness of the proposed DTBH method, we analyzed three important factors: deep feature similarity, triplet selection strategy and triplet regularized term.
The experiments were implemented four ways: Firstly, we used the proposed DTBH method without leveraging the triplet selection strategy (i.e., DTBH-S). Secondly, we used the proposed DTBH method without exploiting the triplet regularized term (i.e., DTBH-R). Thirdly, we utilized the proposed DTBH method without considering deep feature similarity (i.e., DTBH-D). Finally, we leveraged the proposed DTBH method without using the deep feature similarity, triplet selection strategy, and triplet regularized term (i.e., DTBH-T). Table 2 shows contrasting results of DTBH-S, DTBH-R, DTBH-D, DTBH-T, and DTBH on the UCM dataset with different hash codes. Figure 4 shows the comparative results of DTBH-T, DTBH-D, DTBH-R, DTBH-S, and DTBH for different hash bits on the UCM dataset by using RS images to retrieve sounds. Meanwhile, Figure 5 shows the comparative results of DTBH-T, DTBH-D, DTBH-R, DTBH-S, and DTBH for different hash bits on the UCM dataset using the sounds generated from RS images. Here, "S→I" represents the case where the query datasets are RS sounds and the gallery datasets are RS images. "I→S" represents the case where the query datasets are RS images and the gallery datasets are RS sounds. It is clearly seen from Figures 4 and 5, and Table 2 that the proposed DTBH method can achieve superior performance to DTBH-P, DTBH-D, DTBH-Q, and DTBH-I on the MAP with different bits hash codes. For S→I, the proposed DTBH method improved the MAP with 32 bits from DTBH-T (43.25%), DTBH-D (46.65%), DTBH-R (54.61%), and DTBH-S (55.28%) to 58.36%. For I→S, the proposed DTBH method improved the MAP with 32 bits from DTBH-T (48.09%), DTBH-D (52.38%), DTBH-R (60.93%), and DTBH-S (61.38%) to 63.45%. This is because the proposed DTBH method utilizes the deep feature similarity, triplet selection strategy, and triplet regularized term to achieve superior retrieval performance.  To assess the impacts of different convolution architectures for the proposed DTBH method, we evaluate several variants of SDIH. These variants contain DTBH+AlexNet, DTBH+GoogleNet, and DTBH+VGG16. The differences in these variants are the head of the image network. DTBH+VGG16 uses the convolution part of VGG-16 network as the head of the image network. DTBH+AlexNet utilizes the convolution part of AlexNet network as the head of the image network. DTBH+GoogleNet utilizes the convolution part of GoogleNet network as the head of the image network. Table 3 shows the contrasting results of DTBH+AlexNet, DTBH+GoogleNet, and DTBH+VGG16 on UCM dataset with mean average precision (mAP) in different hash bits. It can be observed from Table 3 that the cross-modal hashing algorithm using VGG-16 network can achieve better performance than the identical algorithm using the GoogleNet network and AlexNet network.

Results
(1) Results on UCM: Table 4 shows the performance comparison between the proposed DTBH method and other compared methods on UCM dataset by using sound to retrieve RS image. Table 5 shows the performance comparison between the proposed DTBH method and other compared methods on UCM dataset by using RS images to retrieve sounds. Figure 6 shows precision curves with different samples retrieved by using sounds to retrieve RS images on UCM dataset. Figure 7 shows precision curves with different samples retrieved by using RS images to retrieve sounds on UCM dataset. We can obviously see that: (1) Although these methods have yielded good results, the proposed DTBH method can achieve the highest value in terms of mean average precision, the highest precision for the top sample retrieved, the highest precision for the top five samples retrieved, and the highest precision for the top 10 samples retrieved. Figure 8 shows the top eight retrieval results of the proposed DTBH approach on UCM dataset by utilizing RS images to retrieve voices. Figure 9 shows the top eight retrieval results of the proposed DTBH approach on UCM dataset by utilizing voices to retrieve RS images. (2) It can be clearly seen from Figures 6 and 7  and DTBH-R (56.91%) to 60.13%. This is because the proposed DTBH method not only leverages triplet selection strategy to mine effective triplets, but also exploits deep feature similarity and a triplet regularized term to learn the similarity of hash codes.
(2) Results on Sydney: Table 6 shows the performance comparison between the proposed DTBH method and other methods on the Sydney dataset by using sounds to retrieve RS images. Table 7 shows the performance comparison between the proposed DTBH method and other methods on Sydney dataset by using RS images to retrieve sounds. Precision curves with different samples retrieved by using sounds to retrieve RS images and precision curves with different samples retrieved by using RS images to retrieve sounds are shown in Figures 10 and 11, respectively. Similar experimental results can be clearly seen on UCM dataset. For example, for I→S, the proposed DTBH method improved the MAP from SIFT+M (31.67%), DBLP (44.38%), CNN+SPEC (46.67%), DVAN (71.77%), DTBH-D(81.23%), and DTBH-R(89.64%) to 92.45%. Furthermore, for S→I, the proposed DTBH method improved the MAP from SIFT+M (26.5%), DBLP (34.87%), CNN+SPEC (35.72%), DVAN (63.88%), DTBH-D(76.53%), and DTBH-R(85.46%) to 87.49%. The proposed DTBH method achieves the highest precision in all the evaluation metrics, which demonstrates the effectiveness of cross-modal similarity learning by utilizing deep feature similarity, triplet selection strategy, and triplet regularized term simultaneously.   There are four airplanes at the airport .
There is a white airplane at the airport .
Two same kind of airplanes are stopped at the airport and the ground is dark .
Two airplanes are stopped at the airport and the ground is dark .
It is a white airplane stopped at the airport .
There is an airplane stopped at the airport with some luggage cars beside it .
A medium residential area with houses arranged neatly .

This is a forest with lots of green plants in the shadow .
This is a old baseball diamond .
It is a small baseball diamond with sand and grass .
It is a small baseball diamond with sand and grass . This is a old baseball diamond composed of sand and weeds .
It is a small old baseball diamond. This is a big baseball diamond .
A regular baseball diamond compose of manicured lawns and sand .

Lots of plants scattered in the loess ground.
There are some buildings arranged neatly .
There are some buildings with white roofs .

Many cars parked neatly on the roofs of buildings.
Buildings with black and white roofs with a swimming pool in the middle of it .
Buildings with grey roofs pressed together.
Three storage tanks are on the ground .

Lots of cars parked in lines in the parking lot .
Many houses with plants surrounded in the medium residential area .     (I S) Figure 11. The precision curves with different samples retrieved by using RS images to retrieve sounds from the Sydney dataset.
(3) Results on RSICD dataset: RSICD image-voice dataset consists of 10,921 RS image-sound pairs-making it more complex and challenging than the other two datasets. Table 8 shows the performance comparison between the proposed DTBH method and other methods on the RSICD dataset by using sounds to retrieve RS images. Table 9 shows the performance comparison between the proposed DTBH method and compared methods on the RSICD dataset by using RS image to retrieve sound. Figure 12 shows precision curves with different samples retrieved by using sounds to retrieve RS images. Figure 13 shows precision curves with different samples retrieved by using RS images to retrieve sounds from the RSICD dataset. The proposed DTBH method achieved highest Precision@1, Precision@5, Precision@10, and MAP results, which further demonstrates the effectiveness of the DTBH method.

Parameter Discussion
To implement parameter discussion for the proposed DTBH method, we performed experiments regarding the two parameters α and β of Equation (18) on the UCM dataset. First, we set the parameter α to 1. The parameter β changed form 0 to 10. Figure 14 shows MAP variations with the parameter β for different hash bits on the UCM dataset by utilizing voices to retrieve RS images. It can be seen from Figure 14 that the proposed DTBH approach can achieve the best MAP when the parameter β = 0.01. Second, the parameter β is fixed to 0.01. The parameter α changes form 0 to 10. Figure 15 shows MAP variations with the parameter α for different hash bits on the UCM dataset by utilizing sound to retrieve RS images. It is observed from Figure 15 that the proposed DTBH approach can achieve the best MAP when the parameter α = 1. Hence, the parameter α and the parameter β were set to 1 and 0.01, respectively.

Conclusions
In this paper, we proposed a novel deep triplet-based hashing (DTBH) approach, which leverages deep feature similarity and the triplet selection strategy to guide hash codes learning in RS cross-modal image-sound retrieval. Specially, compared with high-dimensional real-valued features, hash codes can reduce storage costs. Firstly, we proposed a new triplet selection strategy, which can select effective triplets to capture the intra-class and inter-class variations for hash codes learning. Secondly, we proposed a novel objective function, which consists of the triplet similarity loss function, the triplet regularized loss function, and the deep feature triplet loss function. The triplet similarity loss function makes sure that that the anchor images are more similar to the positive sounds than the negative sounds. The deep feature triplet loss function ensures that the anchor deep features are more similar to the positive deep features than to the negative deep features. The triplet regularized loss function can reduce the information loss between hash-like codes and hash codes. Finally, for sound query RS image task, the proposed approach can achieve a mean average precision up to 60.13% on the UCM dataset, 87.49% on the Sydney dataset, and 22.72% on the RSICD dataset. For RS image query sound task, the proposed approach can achieve a mean average precision up to 64.27% on the UCM dataset, 92.45% on the Sydney dataset, and 23.46% on the RSICD dataset. Moreover, extensive experimental results on UCM, Sydney, and RSICD datasets show that the DTBH method can achieve better performance than other state-of-the-art cross-modal image-sound retrieval methods. Future work can be divided into two main aspects. First, we plan to exploit DTBH in other applications, such as cross-modal biometric matching, to demonstrate its extensive effectiveness. Second, we will focus on how to combine the balanced property of hash codes to improve image-sound retrieval performance.
Author Contributions: Y.C. and X.L. made contributions to proposing the method, doing the experiments, and analyzing the results. Y.C. and X.L. were involved in the preparation and revision of the manuscript. All authors have read and agreed to the published version of the manuscript.