TDCMR: Triplet-Based Deep Cross-Modal Retrieval for Geo-Multimedia Data

Mass multimedia data with geographical information (geo-multimedia) are collected and stored on the Internet due to the wide application of location-based services (LBS). How to find the high-level semantic relationship between geo-multimedia data and construct efficient index is crucial for large-scale geo-multimedia retrieval. To combat this challenge, the paper proposes a deep cross-modal hashing framework for geo-multimedia retrieval, termed as Triplet-based Deep CrossModal Retrieval (TDCMR), which utilizes deep neural network and an enhanced triplet constraint to capture high-level semantics. Besides, a novel hybrid index, called TH-Quadtree, is developed by combining cross-modal binary hash codes and quadtree to support high-performance search. Extensive experiments are conducted on three common used benchmarks, and the results show the superior performance of the proposed method.


Introduction
With the rapid development of mobile internet, social networks, and Location-Based Service (LBS), large numbers of multimedia data [1] with geographical information (a.k.a geo-multimedia) [2], such as text, image [3,4], and video [5][6][7][8], are collected and stored on the internet. As an important data resource, geo-multimedia data is used to support for location-based recommendation, accurate advertising and data search. Nearest neighbor spatial keyword query (NNSKQ) is a very important retrieval technique in LBS applications, which only focuses on location information and keyword information to find spatial objects. That means it is limited to structured data or text modality [9][10][11], which cannot be directly applied to geo-multimedia data [12][13][14]. However, the traditional multi-modal retrieval techniques ignore the geographic location information. To solve this dilemma, many researchers have tried to integrate multi-modal information into the query and proposed an effective nearest neighbor query method for geo-multimedia data [15].
In addition, two groups of tasks, i.e., cross-modal retrieval and spatial textual query, interlock with geo-multimedia retrieval. Cross-modal retrieval [16] is a hotspot in the multimedia community, which is aiming to search multimedia instance by queries of different modalities [17][18][19][20]. The challenge of cross-modal retrieval is to diminish semantic gap between different modalities, which is the main obstacle to measure cross-modal semantic similarity. Recently, lots of deep learning-based works are proposed [13,14,[21][22][23][24], which outperform the traditional hand-crafted feature-based approaches. On the other hand, in order to better apply on large-scale multimedia database, many researchers focus on cross-modal hashing method to reduce the search and storage cost [25][26][27][28]. Compact binary hash codes are generated from multimedia instances by deep neural networks, which contains semantic information. The other task, spatial textual query, is to do a keyword-based query by the aid of geographical information to reduce the candidate set substantially. Several important researches [29][30][31][32][33] are proposed in the last decade, which are support efficient LBS. Recently, some studies [2,34,35] extended spatial indexes to multimedia data, such as geo-images top-k query, k nearest neighbor query, spatial visual similarity join, and geo-image reverse query. These works make full use of geo-multimedia data via hybrid indexes to organize geo-multimedia instances through both contents and geo-locations.
Motivation. Although great progress has been made, there are still two challenges in cross-modal retrieval for geo-multimedia data. One challenge is insufficient semantic similarity learning of geo-multimedia data lead to inaccuracy of retrieval. Two problems of cross-modal hashing [36][37][38] cannot be neglected: On the one hand is how to extract the multi-modal features effectively to capture high-level semantic features [39]. Traditional methods use hand-crafted features for hash function learning, which cannot represent high-level semantics efficiently. Compared with them, the deep neural network can be used to extract the high-level semantics. On the other hand, how to establish the semantic relationships effectively to narrow the semantic gap. The existing cross-modal hashing methods use pairwise similarity constraints (pairwise label) as supervision information, so that the distance between similar image-text pairs is less than that of dissimilar image-text pairs [40]. That means the relative semantic relationship between cross-modal data is lost, which limits the semantic representation capability during hashing learning. The other challenge is inefficient index and retrieval algorithm in massive geo-multimedia database. To overcome this problem, in Reference [41], a novel hybrid index called GMR-Tree is developed that is an extension of R-Tree by integrating cross-modal representation. However, this work ignores the cross-modal hashing representation that can enhance the search efficiency significantly. The GMR-Tree-based search algorithm cannot be directly used for crossmodal hashing retrieval. Therefore, to address these two challenges, we try to improve the semantic similarity constraints to guide the deep neural networks and designed a hybrid geo-multimedia index to organize cross-modal hash codes. Last but not least, we introduce one detailed example to describe this problem more vividly as follows.
Example 1: As illustrated in Figure 1, a user wants to buy a blue shoe from a nearby shop. Unfortunately, he does not know the brand and of the shoe, and it is hard for him to use some words to describe the features of this shoe. Obviously, it is hard for him to find suitable shoes from nearby shops with limited information. In this case, he can input an image, which describes this blue shoe, and his location information by his mobile as a spatial multimedia k nearest neighbor query to a geo-tagged multimedia data retrieval system. According to his requirement, this system will return a result set, which contains k geo-multimedia data meeting his requirements. This result set indicates the user which shops have this kind of shoe and are close to his location.
Our Method. To this end, this paper proposes a novel efficient cross-modal hashing approach, termed as Triplet-based Deep Cross-Modal Retrieval (TDCMR). Specifically, a two-branch deep neural network-based backbone is integrated in the TDCMR framework, which is used to learn abstract semantic concepts. Besides, an improved triplet distance constraint is designed to capture multiple high-level similarities to capture semantic relationship among heterogeneous multi-modal data. Thus, the integrating of deep representation learning with an enhanced triplet distance constraint improves the cross-modal semantic learning performance. In addition, to realize efficient search on large-scale geo-multimedia data, a novel index, termed as TDCMR-Quadtree is proposed. It is a geo-semantic hybrid index integrated quadtree with cross-modal hash codes, which utilizes both geographic information and semantic similarity to realize candidate set prun-ing. Based on the index, an efficient cross-modal nearest neighbor query algorithm is developed for geo-multimedia retrieval.  Figure 1. An example of a spatial multimedia k nearest neighbor query on a geo-tagged multimedia data retrieval system.

Contributions.
The main contributions of our work are summarized as follows: • We propose a triplet-based deep cross-modal hashing framework, named Tripletbased Deep Cross-Modal Retrieval (TDCMR), which aims to extract deep sample features to alleviate the semantic gap through a triplet deep neural network unified feature learning and hash learning process. • We carefully apply the efficiency TDCMR algorithm to utilize the Quadtree to improve and enhance the search performance significantly, named TDCMR-Quadtree, which is a novel index framework and can improve the retrieval efficiency. • We have conducted extensive experiments on three common used benchmarks, and the results demonstrate that our proposed method achieves very high performance.

Roadmap.
In the remainder of this paper, we review the previous researches in Section 2. In Section 3, we introduce the definition of geo-multimedia data k nearest neighbor query and the related notions. In Section 4, we introduce the Triplet-based Deep Cross-Modal Retrieval (TDCMR) framework and its implementation. In Section 5, we evaluate our method on two geo-multimedia datasets. Finally, we conclude this paper in Section 6.

Cross-Modal Hashing
Cross-Modal Hashing aims to map high-dimension data of different modalities into a common hash code space, in which the heterogeneous data realizes semantic representation and similarity measurement. With the rapid development of deep learning [21,22], deep learning-based cross-modal hashing [25][26][27][28] has made significant progress recently. It can effectively capture high-level information and explore semantic relevance to bridge modality gap [42]. To preserve the cross-modal similarities, a negative log-likelihood loss function is used by Deep cross-modal hashing (DCMH) [43], which performs feature learning and hash-code learning in an end-to-end learning framework. Pairwise Relationship Guided Deep Hashing (PRDH) [40] integrates different types of pairwise constraints to guide the hash code learning from intra-modality and inter-modality, respectively, and introduces additional decorrelation constraints to enhance the discriminative ability of each hash bit. To enhance the retrieval accuracy, Self-Supervised Adversarial Hashing (SSAH) [44] enhances the retrieval accuracy by jointly utilizing two adversarial networks, but its training cost is a little high, and its practical value is too low.

Spatial Related Data Retrieval
With the proliferation of local services and GPS-enabled mobile phones, there is a rapidly growing amount of spatio-textual data and increased need for spatial data retrieval, so Spatial Keyword k Nearest Neighbor Query (sKkNN) is becoming an important type of query. Zhang et al. [29] designed a new index named IL-Quadtree and proposed an efficient algorithm to improve the performance of query. Cong et al. [30] proposed a new indexing framework, which adopts an invert file to search text data and employs R-tree to index closed objects, based on top-k aggregation problem. In order to further improve retrieval performance, Rocha-Junior et al. [32] invented spatial inverted index technology.

Quadtree
Quadtree is a tree data structure proposed by Raphael Finkel et al. in 1974. Spatial indices store and manage the spatial data sequentially according to the geographic location, shape and spatial relationship of spatial objects, which contain the identifier, pointer, and other description information of spatial objects [29]. Compared to conventional index types, spatial indices can efficiently handle spatial queries [45], such as how far two points differ, or whether points fall within a spatial area of interest. We use the quadtree index to organize the spatial data in this paper because of its simple structure and high retrieval efficiency. Quadtree is an extension of binary tree in high-dimensional space and has many variants, such as region quadtree, point quadtree, PR quadtree, MX quadtree, etc. The basic idea of the quadtree is to recursively subdivide a two-dimensional space into different levels of tree. Specially, it partitions the geographic space into four quadrants or regions in two orthogonal directions, and it recursively subdivide the subspaces until the tree reaches the maximum depth or the number of objects in the leaf node is less than or equal to the predetermined amount.

Preliminaries
In this section, we propose the formal definitions of geo-multimedia data similarity and k nearest neighbor query. Then, we review Siamese network and triple network, which are the basis of our work. Table 1 summarizes the frequently used notations in this paper.

Problem Definition
The cross-modal hash algorithm solves the problem of uniform representation and mutual retrieval of multi-modal data through hash learning of multi-modal data. Assuming a given set of training data containing n sets , the i-th set of training data is composed of text modal data T i , image modal data I i , and category label t i . Tuple class label subscripts constitute datasets Γ = {a i , p i , n i } m i=1 , and the label subscript (a j , p j , n j ) of the j-th group of triples indicates that the similarity between anchor sample a i and positive sample p i , which is higher than that between anchor sample and negative sample n i . That is, samples with a j and p j have common category labels, while samples a j and n j do not have common category labels.
Definition 1 (Spatial similarity). Given a geo-multimedia query q = (q.loc, q.m) and a space object o, then, spatial similarity between q and o is defined as ratio of euclidean space distance δ(q, o) between q and o to maximum euclidean space distance δ max (q, O), which can be expressed as: where δ(q.loc, o.loc) represents the euclidean distance between query q and spatial object o, and δ max (q, O) represents the maximum euclidean distance between a query q and any spatial object in the spatial object dataset O, which can be expressed as: Definition 2 (Cross-modal semantic similarity). Given a geo-multimedia Query q = (q.loc, q.m) and a space object o, then, the cross-modal semantic similarity between q and o is defined as the cosine value of the TDCMR hash code and the spatial object, which can be expressed as: where v T represents text mode TDCMR hash code, v I represents image mode TDCMR hash code, and ||q.v T || and ||o.v I || are query object and space object TDCMR Hash code module.
Definition 3 (geo-multimedia Data Similarity). Given a geo-multimedia data query q = (q.loc, q.m) and a geo-multimedia data object o, then, the similarity between q and o is defined as the weighted sum of spatial similarity and cross-modal semantic similarity, which can be expressed as: where f s (q, o) and f c (q, o) are spatial similarity and cross-modal semantic similarity. Besides, parameter λ ∈ [0, 1] is the weight factor used to balance spatial similarity and cross-modal semantic similarity. Finally, geo-multimedia data similarity score refers to the weighted score of similarity between query object and dataset object in spatial and semantic aspects, which can better meet the query processing in practical application scenarios.
Definition 4 (Spatial multimedia k nearest neighbor query). Given a Spatial Multimedia Query q = (q.loc, q.m) and a space object set O, make ∀o ∈ R ∧ ∀o ∈ (O − R) and F d (q, o) ≥ F d (q, o ) to find a subset R having k spatial objects of O as a result of the query. Spatial text object query spatial image object q t2i = (q.loc, q.m T ) and return a result set R t2i with k spatial image objects, which can be described as: Spatial image object query spatial text object q i2t = (q.loc, q.m I ) and return a result set R i2t with k spatial text objects, which can be described as: As shown in Table 2, it gives an example of a space text object query space image object GMkNN query, as a user visited a city, because the itinerary arrangement can only selectively visit a certain scenic spot. At this point, the user can enter a text to describe the site of interest, search engine can return to the user's interest and close to the site of the image, video and other content. Users can visually browse the image or video of the scenic spot to decide whether to visit the scenic spot. In addition, there are six scenic spots, o 1 , o 2 , o 3 , o 4 , o 5 , o 6 , in the city. For a given spatial text, query q t2i = (q.loc, q.m T ), and set α as 0.5. Compute the spatial similarity and cross-modal semantic similarity of query object and space object, respectively, o 1 , o 2 , o 3 , o 4 , o 5 , o 6 , and the geo-multimedia data similarity score between query object q t2i and space objects o 1 , o 2 , o 3 , o 4 , o 5 , o 6 ; therefore, as k = 1, the query q t2i result of GMkNN is {o 2 }, which returns the object o 2 of image content to the user.

Siamese Network and Triple Network
Siamese Network and Triple Network are members of multiple convolution neural network model. Compared with simple convolution neural network, they are composed of two or more convolution neural networks which have the same structure and shared parameters. The following mainly discusses Siamese Network and Triple Network.

Siamese Network
Siamese Network consists of two convolution neural networks which have the same structure and shared parameters, and it aims to map two input data to a measurable space for similarity comparison via a common function. Its objective is to minimize the similarity among the samples within the same category and maximize the similarity among the samples within the different categories. Specifically, Siamese Network searches for a set of parameters u through distance metric, such that, when T 1 and T 2 belong to the same category, the similarity is low, and, when T 1 and T 2 belong to the different categories, the similarity is high. From this, a pairwise constraint loss function is designed, which is shown in Formula (3).
In the formula, D(T 1 , T 2 ) is defined as the square of the euclidean distance between Siamese Network outputs, as shown in Formula (4): (4) f (T) is the output of network. I represents the category relationship between sample T 1 and T 2 . When samples T 1 and T 2 belong to the same category, I = 1, and, when samples T 1 and T 2 belong to the different categories, I = 0. m is a boundary value used to control the degree of the loss function.

Triple Network
The Triple Network model is developed based on Siamese Network, which consists of three convolution neural networks with the same structure and shared parameters. The Triplet Network's input is a triplet that contains an anchor sample, a positive sample, and a negative sample. Generally, the anchor sample and the positive sample are sample pairs that belong to the same category or have related content, while the relationship between anchor sample and negative sample are opposite. The triplet describes the relationship between the three samples, with which the network is trained. After distance metric optimization, the distance of anchor sample is close to the positive sample and far away from the negative sample. Specially, there are N triplets defined which represent anchor sample, positive sample, and negative sample. The feature representation of the sample is obtained by the convolution neural network. According to the distance relationship between the triples constraint and the sample feature, the triples loss function is defined as below: In this formula, α refers to interval value of D(T a , T p ) and D(T a , T n ), similar to Formula (4), and D(T a , T p ) is defined as the square of the euclidean distance between triple network outputs.

Overview of the Framework
To solve the problem of low representation ability and slow query speed in geomultimedia data representation and query, this paper aims to narrow the cognitive gap between human and computer in multimedia data semantic understanding through a deep neural network, construct the deep cross-modal hash (Triplet-based Deep Cross-Modal Retrieval, TDCMR) network model based on triples, and encode geo-multimedia data semantically by a trained network model. Then, TDCMR Hashing Quadtree (TH-Quadtree) geo-semantic hybrid index and its query algorithm are used to search the-nearest-neighbor and semantic-related Top-k geo-multimedia data objects quickly and accurately in the massive geo-multimedia database.

Quantitative Coding of Geo-Multimedia Data Schemes
According to geo-multimedia data has the characteristics of polymorphism, heterogeneity and semantic interconnection, the traditional cross-modal hash algorithm uses low-level artificial features, and the existence of semantic gap leads to low semantic representation ability of cross-modal hash code. To better alleviate the semantic gap, this paper proposes a deep cross-modal hash algorithm based on triples TDCMR, which integrates feature learning and hash learning process through triples deep neural network, as well as designs improved triples distance constraints. The aim of this paper is to improve the semantic representation of cross-modal hash codes by forcing the same kinds of heterogeneous data close to each other in hamming space, according to the valid semantic quantization coding. Figure 2 illustrates the proposed framework for TDCMR problem. As discussed above, the whole framework is composed of a deep feature extraction module and a hash code learning module, which is unified into an end-to-end framework. Among them, the first part is the deep feature extraction module, which consists of a multi-layer perceptron to extract text features and a deep convolutional neural network to extract image features, aiming to extract deep features of samples to alleviate the semantic gap of heterogeneous data. The second part is the hash code learning module, which constructs the semantic association between anchor sample and positive sample and negative sample through a distance learning process, and also constructs the semantic association between positive sample and anchor sample and negative sample through a distance learning process. Two distance learning processes are completed by one sample input, aiming at the heterogeneous data of the same category in hamming space being forced to approach each other.

Deep Feature Extraction
In this subsection, we employ deep convolution neural network CNN-F to extract the features of image modes and the multi-layer perceptron to extract the features of text modes. Besides, the deep feature extraction module consists of two deep neural networks.

CNN-F
The Deep Convolution Neural Network CNN-F network [46] consists of 5 convolutional layers and 3 fully connected layers. fc8 is a fully connected layer with a number of nodes c the length of the hash code, so as to facilitate the mapping of the image depth features extracted by the deep convolutional neural network into hash code representation. Among them, the first convolution layer conv1 uses convolution operation with step size 4, and the second convolution layer conv2 to the fifth convolution layer conv5 all use convolution operation with step size 1. In addition, maximum pooling operation in conv1, conv2, and conv5 can effectively reduce model parameters and prevent over-fitting. Similarly, the use of Dropout regularization techniques in full connection layers fc6 and fc7 can effectively prevent over-fitting.

MLP
Multi-layer perceptron network (MLP) consists of 3 fully connected layers. The number of nodes in layer 1 is the same as the dimension of the word bag vector input text data. The number of nodes in the layer 2 fully connected layer is set to 4096, and the number of nodes in the last layer fully connected layer is set to hash code length c, so as to facilitate the mapping of text modal depth features extracted by multi-layer perceptron network into hash code representation.

The Baseline for Triplet-Based Deep Cross-Modal Hashing
Given a set of triples(T a i , I p i , I n i ), triplet sample distance learning goals is to close distance between anchor sample T a i , regular sample I p i and long distance between anchor sample T a i and negative sample I n i , so distance constraints of triplet samples can be defined as: where α is the interval value of distance D(T a i , I p i ) between anchor sample and positive sample and the distance D(T a i , I n i ) between anchor sample and negative sample.
The triples loss function as shown in Formula (5) can be constructed according to the triples sample distance constraint in Formula (6). During the training of the network model, it is found that, when the distance interval value α is smaller, the distance D(T a i , I p i ) between the anchor sample T a i and the positive sample I p i is closer to the distance D(T a i , I n i ). Although the loss function can quickly converge and close to 0, similar text and image modal samples are difficult to distinguish. When the distance interval value α is larger, the distance D(T a i , I p i ) between the anchor sample T a i and the positive sample I p i is much smaller than the distance D(T a i , I n i ). Similar text and image modal samples are easy to distinguish, but network models are difficult to converge.
Based on the triples sample distance constraint, the improved triples sample distance constraint is proposed. In detail, the improved triples sample distance learning goal is to construct the similarity relationship between anchor sample and positive sample and negative sample, so that the distance between anchor sample and positive sample is less than that between anchor sample and negative sample, and, at the same time, to construct the similarity relationship among positive sample, anchor sample, and negative sample, so that the distance between positive sample and anchor sample is less than that between positive sample and negative sample. By learning from two sets of sample distance relationships, heterogeneous data of the same category is forced close to each other to improve the learning ability of the network model and effectively realize that the intra-class distance is less than the inter-class distance. Improved triples sample distance constraints are formally expressed as shown in Formula (7): where the distance interval value α between D(T a i , I p i ) and D(T a i , I n i ) is a custom parameter, and the distance interval value β between D(T p i , I p i ) and D(T p i , I n i ) is also a custom parameter. These parameters control the distance relationship between anchor sample, positive sample, and negative sample as balance parameters.
In cross-modal hash learning, the semantic relationship between triples samples is described by the triples likelihood function. Assuming that the anchor sample is text mode and the positive sample and negative sample are image mode, the improved triples likelihood function is proposed according to the improved triples sample distance constraint in Formula (7), as shown in Formula (8): p((a j , p j , n j )|B T , B I , where ξ a T , B T * j and B I * j are the feature output of text and image modes, respectively, B T * j ∈ {−1, 1} c , B I * j ∈ {−1, 1} c , u T and u I are text feature extraction network and image feature extraction network, and σ(T) denotes that the probability is calculated as a sigmoid function. α is the interval value between anchor sample and positive sample feature distance and anchor sample and negative sample feature distance, and β is the interval between the positive sample and the anchor sample and the positive sample and the negative sample.
Based on the improved triplet likelihood function, the heterogeneous association between different modal data is established by its negative logarithmic likelihood loss. The triplet loss function L t2i from text mode to image mode is shown in Formula (10).
Similarly, the triplet loss function from image mode to text mode is shown in Formula (11).
Therefore, according to Formulas (10) and (11), the complete form of loss function of depth-span modal hash algorithm based on triples is shown in Formula (12): where B T represents the feature vector matrix of the learned text modal data, and B I represents the feature vector matrix of the learned image modal data; they contain the relative semantic relationship in the triple tag, and V T , V I , respectively, represent the text modal hash code matrix of modal and image modal data, the data feature vectors pass the semantic relationship to the corresponding hash code, where V T = sign(B T ),V I = sign(B I ). Jiang et al. [43] confirmed by a large number of experiments that better network performance can be obtained by assuming that the text mode hash code is the same as the image mode hash code during the training of the network. Therefore, the constraint condition V = V T = V I is added on the basis of the objective loss function shown in Formula (10), and the final complete triples loss function is shown in Formula (11): By optimizing the loss function shown in Formula (14), the triple network can learn deep neural network parameters and hash code representation at the same time, as well as realize end-to-end learning. The first and second terms of the loss function are improved triple-negative log-likelihood loss functions. In the optimization learning process of these two terms, the cross-modal similarity of the data in the original semantic space is preserved. The third term of the loss function γ(||V − B T || 2 B I + ||V − B I || 2 B I ) is the regularization term. By optimizing this term, the quantization error is reduced, so that the cross-modal hash code better retains the semantic similarity in the data features. The fourth term of the loss function η(||B T 1|| 2 B I + ||B I 1|| 2 B I ) is also a regularization term. By optimizing this term, the balance of hash code values is ensured, i.e., the number of +1 and −1 elements in the same position of the hash code is the same, so that each the information contained in the bit hash code is maximized.
Driven by a large number of image-text datasets, the optimized text feature extraction network parameters u T , image feature extraction network parameters u I , and hash code matrix V are obtained by using random gradient descent algorithm and alternating iteration strategy to quickly get and optimize the TDCMR network model; then, the network model branches are selected according to the modal types of the input data, and the deep features of the input data are extracted to obtain the cross-modal hash code.
Specifically, the optimization of the triple loss function shown in Formula (13) is a non-convex problem. Therefore, the random gradient descent algorithm simultaneously uses the alternating optimization strategy to learn parameters u T , u I , and the hash code matrix V, when updating one parameter, the other two parameters are fixed, the third parameter is optimized, and the optimization process is alternately carried out until the model converges or reaches the maximum number of iterations.

Update u T
We learn u T with fixed u I and V. For each iteration, a batch-size data input network is randomly selected from the training dataset, and the back-propagation algorithm is used to learn the text features to extract the network parameters u T . The gradient of the i-th text data object B T * i to calculate the loss function is shown in Formula (14): Compute ∂L ∂u T to update parameter u T :

Update u I
We learn u I with fixed u T and V. For each iteration, a batch-size data input network is randomly selected from the training dataset, and the back propagation algorithm is used to learn the image features to extract the network parameters u I . The gradient of the loss function is calculated by the i-th image data object, as shown in Formula (16): Compute ∂L ∂u I to update parameter u I :

Update V
We have fixed parameters u T and u I as learning hash code matrix V. By the relation between trace and norm of matrix, to matrix N, ||N|| 2 B I = tr(NN ) = tr(N N). Thus, the loss function can be simplified as shown in Formula (18): Keep V ij and B ij the same sign: Given text modal data and image modal data, semantic quantization coding of text modal data and image modal data can be realized by TDCMR model. Meanwhile, semantic similarity of heterogeneous multimedia data can be measured by hamming distance. During the concrete process, the text modal data T generates semantic quantization coding v T , and the calculation process is shown in Formula (21): Image modal data I generate semantic quantization coding v I , as shown in Formula (22): v I = b I (I) = sign( f I (I; u I )).
As shown in the Algorithm 1, the following is our optimization procedure of the proposed TDCMR.
Algorithm 1 Optimization procedure of the proposed TDCMR 1: Input Textual dataset T ; Image dataset I; Tuple label Γ; 2: Output Text feature extraction network parameters u T ; Image feature extraction network parameters u I ; Hash code matrix V; 3: Initialize parameter u T and u I ; 4: The number of samples taken for each iteration N T = N I = 128; 5: The maximum number of iterations t T = n N T t I = n N I ; 6: for p = 1 to N do 7: for i = 1 to t T do 8: Random sampling of N T text samples from T to build a batch of dataset; 9: Taking anchor samples from batch data to build a set of triples samples; 10: For each text sample T i , calculated G * i = f T (T i ; u T ) by forward propagation; 11: Gradient ∂L ∂u T in Equations (3)-(10); 12: Update parameters u T by backward propagation; 13: end for 14: for j = 1 to t I do 15: Random sampling of N I text samples from I to build a batch of dataset; 16: Taking anchor samples from batch data to build a set of triples samples; 17: For each text sample I i , calculated B I * i = f I (I i ; u I ) by forward propagation; 18: Gradient ∂L ∂u I in Equation (15); 19: Update parameters u I by backward propagation; 20: end for 21: Update V in Equation (18); 22: end for

Algorithm Analysis
Based on the analysis of the training effect of TDCMR algorithm, the selection of triplet samples is the key of model training effect. Give a triples sample (T a , T p , T n ), which is divided into the following categories: • Simple triples: triples with a loss function value of 0. The distance of triples is satisfied D(T a , T n ) > D(T a , T p ) + margin, i.e., the distance between anchor sample T a and positive sample T p is less than the distance between anchor sample T a and negative sample T η margin, and the negative sample is easy to identify. • Semi-difficult triples: triples with a loss function close to 0 and the distance relation of triples satisfied D(T a , T p ) < D(T a , T n ) < D(T a , T p ) + margin, and negative sample T n is close to anchor sample T a , and negative sample is easy to identify. • Difficult triples: triples with a loss function value greater than 0, the distance relation of triples satisfied D(T a , T n ) < D(T a , T p ), and the negative sample T n is closer to the anchor point sample T a than the positive sample T p , and the negative sample is difficult to identify.
The selection of triples affects the training effect of the model. Simple triples are easy to identify but cannot provide effective information for network model training. Besides, difficult triples are difficult to identify, and all difficult triples are easy to diverge the network model and seriously affect training efficiency. With the training of network model, the number of easy triples and semi-difficult triples will be much larger than the number of difficult triples, which leads to the difficulty of continuous optimization of network model in the later stage of training. Therefore, we adopt a two-stage strategy to select the three-component sample training network model. In the early stage of training, the semi-difficult triples are selected as the training data to train the network model, which makes the network model fit converge. In the later stage of training, the difficult triples are selected as the training data, and the network model is fine-tuned to obtain the optimal network model parameters and improve the training efficiency of the network model. Figure 3 illustrates the proposed framework for index problem. The quadtree and the semantic hash table are integrated in the vertical dimension by the order of first space and then semantics. The location information is first organized according to the structure of the quadtree, then the spatial objects contained in the quadtree leaf nodes are semantically quantized by the cross-modal hash algorithm, and the hash table (Hash table) is associated to the corresponding leaf nodes according to the cross-modal hash code. The geo-semantic hybrid index TH-Quadtree is established to speed up the access to the spatial objects in the O of the geo-multimedia dataset. The structure is shown in Figure 3, and the quadtree is a tree-type index structure for accelerating spatial distance, which can organize spatial information efficiently; The TDCMR, where the cross-modal hash code describes the semantic information of geo-multimedia data, hash table supports organizing semantic information with lower storage space and search time; organizing quadtree and semantic hash table by spatial-first coupling can ensure that geo-multimedia data k nearest neighbor queries can quickly retrieve spatial objects that meet the requirements in a given spatial limitation and query semantics. The TH-Quadtree index combines the information of geo-multimedia data object space and semantics. It is a two-layer hybrid index structure, which is mainly composed of two parts: the quadtree of the spatial layer and the hash table of the semantic layer.  Figure 3. The framework of TDCMR-Quadtree-based index method, and it includes two layers which is space layer and semantic layer.

Space Layer
TH-Quadtree is a two-tier hybrid index structure that integrates the spatial and semantic layers on the vertical dimensions. In the spatial dimension, the spatial information of spatial objects is generally represented by two-dimensional latitude and longitude coordinates, which has better pruning effect than high-dimensional semantic information. Therefore, the spatial layer index is constructed by using the spatial position relation of spatial objects in geo-multimedia dataset, as the first layer of TH-Quadtree index structure. In this paper, quadtree is used to index the spatial position information of all spatial objects, which is efficient in two-dimensional spatial information organization. First, all spatial objects are regarded as the point set in the geographical space, and then each spatial object belongs to a minimum boundary rectangle MBR, i.e., each node on the quad tree, and then all the MBR are organized into different levels of tree structure according to the spatial distribution. In general, geospatial recursion is divided into hierarchical tree-type structures. Geo-multimedia data objects are all stored on each leaf node, while the root and middle nodes do not store spatial objects.

Semantic Layer
For each leaf node of the quadtree spatial layer, a hash table index is associated as the semantic layer of the second layer of the TH-Quadtree index structure to facilitate pruning in the semantic dimension. The semantic quantization coding of all geo-multimedia data objects in leaf nodes is obtained according to the TDCMR cross-modal hash algorithm, i.e., cross-modal hash code. Then, the uID identification code of geo-multimedia data objects is stored in a hash bucket with c bit binary encoding as key value to generate a hash table containing all spatial object semantic information of leaf nodes. The spatial objects in the hash table are located in the same hash bucket, and they have high spatial similarity, on the one hand, in the same leaf node; on the other hand, they have high semantic similarity due to the same semantic quantization coding.

TH-Quadtree-Based Nearest Neighbor Query Algorithm
The main idea of geo-multimedia data nearest neighbor query algorithm based on TH-Quadtree is: Given a query object q, the spatial object is searched orderly in the spatial layer and semantic layer. Starting with the root node of the index structure, the TH-Quadtree index structure space is traversed by TH-Quadtree index structure space according to the Spatial Best Match Proximity according to the principle of the best priority nodes of the layer to continuously obtain the tree nodes closest to the spatial position q the query object, where the optimal spatial similarity calculation is shown in Formula (23): where f s (q, N) stands for the spatial similarity node N and query objects q, and optimal spatial similarity f sbm (q, N) is the lower bound of the score of spatial object similarity o geomultimedia data in query object q and node N. Based on the above optimal spatial similarity f sbm (q, N), when the query processing process accesses the leaf node, it transforms from the spatial layer search to the semantic layer search. The candidate sets related to query object semantics in the hash table associated with the leaf node are obtained quickly by Hashing Looking. Then, for the spatial object fusion spatial similarity and semantic similarity in the candidate set, the optimal spatial object update result set R is selected according to the geo-multimedia data similarity score F GM (q, o). During the whole search, the result set R is used maintain the traversed space object dynamically, and the current knot is formed, and a small geo-multimedia data similarity score k the fruit set is used as the upper bound of the result set, and the search is terminated when the node that has not been accessed satisfies the condition of Formula (24), and the current result set is returned as the optimal query result.
where f sbm (q, N) is the distance lower bound of the spatial similarity between all spatial objects with N as the root node q the spatial similarity of the query object. When the distance lower bound of the spatial similarity is larger than the distance upper bound R the known result set, then, all spatial objects that are not accessed have no chance of better than the Top-k results in the current result set, then the search process terminates.
Since the spatial distance between query q and any spatial object o in node N is greater than the spatial distance from query q to node N, the spatial similarity between query q and node N will not be higher than that between query q and any spatial object o spatial similarity, i.e., f s (q, o) ≥ f s (q, N), similarly, since node N is the top element of priority queue L, the spatial similarity of query q and node N is the lower bound of the spatial similarity of all currently unvisited nodes and query q. As f sbm (q, N) > D ub , the similarity scores of all spatial objects F GM (q, o) that have not been accessed, and the geo-multimedia data of query q would be greater than the upper bound of distance D ub , λ · f s (q, o) + (1 − λ) · f c (q, o) > D ub . Therefore, compared with the current Top − k search results, all unaccessed spatial objects have no chance to be closer to the query q, and the current result set R is the optimal solution, which can terminate the query process.
Given a geo-multimedia data k nearest neighbor query q, the distance upper bound of the result set R and the priority queue L sorted according to the spatial similarity score from small to large. In the query process, for the top element N popped by the priority queue L, the query termination condition is λ · f s (q, N) > D ub .

Datasets
Performance of the proposed method TDCMR is evaluated on dataset MIRFlickr-25k and NUS-WIDE, and some samples of them are illustrated in Figure 4. The brief introduction of them is shown as follows.
• MIRFlickr-25k [47]. This dataset consists of 25,000 image-text pairs obtained from the Flickr website, and each pair has an image and its corresponding text labels. The dataset contains 24 manually labeled category tags, and each pair was marked with one or more category tags. • NUS-WIDE [48]. This dataset consists of 269,648 image-text pairs obtained from Flickr website and contains 81 manually labeled category tags, and each data pair is also marked as one or more category tags.
Performance of the proposed index TH-Quadtree is evaluated on dataset real data FL and synthetic set IN. The brief introduction of them is shown as follows. • FL. This dataset is generated by image sharing website Flickr (http://www.Flickr.com/ (accessed on 1 July 2021)), containing 1 million images with geographic location information, each containing at least one user-annotated text tag information. • IN. This dataset is a classical image database where each node of the hierarchy is represented by at least 500 images, each concept quality controlled and manually annotated, obtaining spatial location mapping from the U.S. Place Names Commission website (http://geonames.usgs.gov(accessed on 1 July 2021)).

Workload
All the experiments are run on a PC with Intel(R) i7-6800K CPU, 64 G memory, and NVIDIA GeForce GTX 1080ti GPU, running the Ubuntu 16.04 LTS Operation System.

Settings
To evaluate the performance of the TDCMR algorithm, six cross-modal hashing algorithms are introduced for comparison in the experiments. Among these algorithms, CCA is a multivariate statistical analysis method, CVH [49], STMH [50], CMSSH [51], CMFH [52], SCM [53] and SePH [54] are shallow cross-modal hashing algorithms based on artificial features, and DCMH [43], PRDH [40], and TDH [38] are deep cross-modal hashing algorithms, which use the same network as TDCMR algorithm. We use mAP value and PR curve to quantitatively evaluate algorithm performance. We provide a Table 3 to show the benefits and drawbacks of our comparison algorithm.
In the experiment, for the MIRFlickr-25k dataset, 2000 samples are randomly extracted as the test set, and the remaining samples are used as the validation set. During the training, 10,000 samples are randomly extracted from the validation set as training data. For the NUS-WIDE dataset, 1866 samples are randomly extracted as the test set, and the remaining samples are used as the validation set. Similarly, 10,000 samples are randomly chosen from the validation set as training data. The experiment refers to the parameter settings of the paper, and sets the distance parameters α and β to half the length of the hash code. We set the default values of balance parameters, the number of samples for each iteration 128, and the maximum number of iterations 500. If the retrieved data has the same label as the query data, it is considered to be the correct neighbor. TDH Triplet network to process paired data and unpaired data at the same time, and learn feature expression for them.
It is difficult to adjust the training parameters due to the supervision of GAN.

TDCRM
The improved triple loss function effectively alleviates the semantic gap,Th-quadtree efficiently organizes the spatial information and semantic information of spatial multimedia data.
It cannot be directly extended to network-based studies.

Performance Evaluation
In this sectoion, the correctness of our proposed method is verified by the mAP value and PR curve, and the effectiveness of our proposed method is evaluated by the response time in the experiment.

Correctness Comparison
Tables 4 and 5 are the mAP values of each algorithm on the image-to-text retrieval task and the text-to-image retrieval task under different hash code lengths on the MIRFlickr-25k dataset. Tables 6 and 7 are the mAP values of each algorithm on the NUS-WIDE dataset. The best mAP values are displayed in bold font. It is observed from the mAP values in Tables 4-7 that the performance of the TDCMR algorithm outperforms the other ten cross-modal hashing algorithms. On the MIRFlickr-25k dataset, the performance of the TDCMR algorithm in both retrieval tasks is slightly better than the PRDH and DCMH algorithms. On the NUS-WIDE dataset, the performance of the TDCMR algorithm in both retrieval tasks is also superior to the PRDH and DCMH algorithms. The reason is that PRDH and DCMH algorithms employ pair-wise similarity constraints of category labels for hash learning without considering the relative semantic relationship between samples, which, to some extent, loses rich semantic information and results in limited retrieval accuracy. The TDCMR algorithm employs the relative semantic relationship between the three samples to build more semantic associations, while the improved triple loss function allows the hash code to retain more category information. To some extent, it overcomes the disadvantage of pairwise similarity constraint, enhances the representation ability of hash code, and, thus, has higher retrieval accuracy. On the MIRFlickr-25k and NUS-WIDE datasets, the retrieval performance of TDCMR, PRDH, and DCMH algorithms based on deep learning is significantly better than those based on artificial features, such as CCA, CMFH, SCM, and SePH. This is because the algorithms based on deep learning can extract deep salient features of data, and the representation ability of deep features are better than artificial features, which shows the superiority of deep neural network in saliency feature extraction. On the MIRFlickr-25k and NUS-WIDE datasets, with the increase of the hash code's length, the retrieval performance of the seven algorithms has been improved to a certain extent. The reason is that as the length of the hash code increases, the richer the semantic information contained in the code, which improves the retrieval accuracy. It should be noted that, on the one hand, the retrieval performance improves with the increase of the code length in a certain range; on the other hand, excessive coding length leads to over-fitting and other problems, which reduces the retrieval performance. On the MIRFlickr-25k and NUS-WIDE datasets, the performance of the text-to-image retrieval task is always better than image-to-text retrieval task, probably because the hidden semantic information in image is more difficult to extract, so that the text feature extraction network can learn more information.  Figure 5a,b show the accuracy recall curve of each algorithm when using 32-bit hash code on the MIRFlickr-25k dataset, while Figure 6a,b show the accuracy recall curve of each algorithm when using 32-bit hash code on NUS-WIDE dataset. According to the introduction of the accuracy recall curve, the larger the area enclosed by PR curve and coordinate axis, the better retrieval performance algorithm has. The area of TDCMR algorithm is larger than that of the other baseline methods, which is consistent with the mAP value. Above all, we validate the cross-modal retrieval performance of TDCMR on MIRFlickr-25k and NUS-WIDE datasets, and evaluate the retrieval accuracy through mAP value and accuracy-recall curve. The experimental results show that our proposed deep cross-modal hash algorithm has the best performance in the cross-modal retrieval task. Therefore, TDCMR can better retain the semantic information of the original data, as well as achieve state-of-the-art retrieval performance in the field of geo-multimedia data semantic quantization coding. Figure 7a,b show results of this experiment in which we investigate the effect of the number of results (k) by varying the value k from 5 to 25 on dataset FL and IN. As expected, both the response time of each index increase with an increasing value of k. A larger value of k leads to a larger search region in query processing. Compared with traditional spatial pruning technique Quadtree, our proposed technique, TH-Quadtree, takes advantage of the semantic layer hash tables to reduce unnecessary disk access.  Figure 8a,b show results of this experiment in which we investigate the effect of the dataset size (n) by varying the value n from 50 K to 250 K. Similarly, when the dataset size increases, the response time of each index increases since more quadtree cells will be accessed. It is observed that the performance of TH-Quadtree is significantly superior to that of traditional quadtree.

Conclusions
To solve the problems of semantic gap and low semantic representation in the crossmodal hash method, this paper investigates a deep cross-modal hash algorithm based on TDCMR. By integrating feature extraction and hash code learning processes into an endto-end triplet deep neural network model, sample deep features containing rich semantic information are extracted to better narrow the semantic gap. At the same time, this paper optimizes the triples sample distance relation and proposes an improved triples loss function, which makes the same heterogeneous data forced close to each other in Hamming space, and improves the semantic representation ability of the cross-modal hash code. By means of theoretical analysis and experiments, TDCMR cross-modal hash codes have better preserved the semantic information of the original data and have better advantages in the semantic quantization coding of geo-multimedia data.
Aiming at the problems of slow speed and low efficiency in the nearest neighbor query of spatial multimedia data, a new hybrid index TH-Quadtree is proposed based on TDCMH cross-modal hash code and quadtree. TH-Quadtree efficiently organizes the spatial information and semantic information of spatial multimedia data. At the same time, based on TH-Quadtree, the nearest neighbor query algorithm is proposed to support GMkNN query, and the NE nearest neighbor expansion strategy is introduced to optimize the semantic layer search process, so as to quickly and accurately find the spatial nearest neighbor and semantically related spatial multimedia data objects in the massive spatial multimedia database. Theoretical analysis and nearest neighbor query experiments show that the nearest neighbor query algorithm of spatial multimedia data based on quadtree effectively improves the query speed.
Our future work: (1) Semantic quantization coding based on target attention mechanism There has always been an insurmountable semantic gap between different modal data. This paper extracts the features of image mode and text mode through a deep convolution neural network and multi-layer perceptron, which can extract the sample features with rich semantic information and make up the semantic gap to a certain extent. In the future, we try to improve and optimize the feature extraction network structure, learn the features of the target area of image or text by introducing the target attention mechanism, and obtain the data features with more significant information, and further narrow the semantic gap, so as to promote the representation ability of semantic quantitative coding.
(2) Nearest neighbor query of spatial multimedia data under road network At present, a large number of LBS applications are based in the Euclidean space. Euclidean distance is the linear distance between any two points in space. Based on it, this paper uses Euclidean distance to measure the spatial similarity between spatial objects, and the road network distance is the road length between two points in the actual road network, which can more truly reflect people's living environment, our work tends to further expand the nearest neighbor query of spatial multimedia data to the road network environment. G-tree [55] is a road network index structure based on R-tree and preserving the spatial structure, which supports fast processing of kNN query in the road network. Therefore, how to design an efficient road network spatial semantic hybrid index structure and query algorithm based on g-tree is also content that merits being deeply studied in the future.