A Tourist Attraction Recommendation Model Fusing Spatial, Temporal, and Visual Embeddings for Flickr-Geotagged Photos

: The rapid development of social media data, including geotagged photos, has beneﬁted the research of tourism geography; additionally, tourists’ increasing demand for personalized travel has encouraged more researchers to pay attention to tourism recommendation models. However, few studies have comprehensively considered the content and contextual information that may inﬂu-ence the recommendation accuracy, especially tourist attractions’ visual content due to redundant and noisy geotagged photos; therefore, we propose a tourist attraction recommendation model for Flickr-geotagged photos which fuses spatial, temporal, and visual embeddings (STVE). After spatial clustering and extracting visual embeddings of tourist attractions’ representative images, the spatial and temporal embeddings are modeled with the Word2Vec negative sampling strategy, and the visual embeddings are fused with Matrix Factorization and Bayesian Personalized Ranking. The combination of these two parts comprises our proposed STVE model. The experimental results demonstrate that our STVE model outperforms other baseline models. We also analyzed the parameter sensitivity and component performance to prove the performance superiority of our model.


Introduction
With the advent of the "Web 3.0" era [1,2], the Internet users' role has transformed from mere information receivers to producers and interactors of information. A large amount of data containing geographical location has been spontaneously generated by users, including social media check-in data, geotagged photos, etc. These data have gradually augmented or replaced the role of geographic data collected in traditional ways in geography research, including tourism geography research. According to the World Travel & Tourism Council and the World Tourism Organization statistics, the tourism industry accounts for over ten percent of global GDP [3]. Furthermore, the trip volume increases year by year, showing that the tourism industry plays an increasingly important role in the global economy [4]. In addition to the increasing scale, the tourism mode is also gradually changing. Independent travel has become the mainstream mode [5], which created tourists' demand for personalized and intelligent travel.
New tourism demand has also promoted the transformation of the data sources and research goals in tourism geography. Specifically, applying geotagged photos to these studies is also a reflection of acclimating to such a trend. Data of geotagged photos have the advantages of containing a large amount of tourism information and reflecting tourists' real preferences more directly [6,7]. Besides, many studies on tourist attraction recommendation systems have emerged, which aims to meet tourists' increasing demand for intelligent and personalized tourism and solve the problem of tourist information overload [8]. The recommendation methods are generally divided into content-based and collaborative filtering (CF) methods. The content-based method uses the attributes of the items that users prefer to recommend users similar items [9]. Such a method is robust against the cold-start problem-the cold-start problem means the recommendation system can hardly make accurate recommendations when encountering new users or items [10]. Nevertheless, it relies heavily on structured and accurate features, and the accuracy of the recommendation result is comparatively low [11]. The CF-based method collects other users' feedback to filter or rate the recommended items [10]. It has the advantages of fast speed and high accuracy, and thus it is widely used in recommendation systems. However, it cannot handle the cold-start and data sparsity problem well. It can be concluded that both of the recommendation methods have their disadvantages, leading to problems of insufficient recommending accuracy in some scenarios. Therefore, the hybrid recommendation methods that fuse both methods' advantages have gradually become a trend [12,13]. Besides, the machine learning field's embedding models have gradually emerged and developed in the research of recommendation algorithms. Using such a simple and efficient method to fuse content and contextual information in tourist attraction recommendations means that they can learn from each other and improve the recommendation accuracy.
New data sources and new methods have brought new opportunities to research tourist attraction recommendation methods, but they have also brought some challenges. For instance, how to select and represent the appropriate contextual and content information is a question worth considering, especially the visual information of tourist attractions, which is a kind of information that is easily ignored and difficult to extract to a certain extent because of the existence of noisy and redundant photos in geotagged photos. Therefore, we propose a tourist attraction recommendation model fusing spatial, temporal, and visual embeddings (STVE) for geotagged photos. We leverage Flickr-geotagged photos as the dataset to validate our model. The STVE model is built after some preprocessing steps, and it mainly consists of two parts: the embeddings of temporal and spatial constraint information and the embeddings of visual information. The embeddings of temporal and spatial constraint information are obtained by the negative sampling strategy of Word2Vec; then, we use matrix factorization and Bayesian Personalized Ranking and combine the embeddings of the above representative images results to get the interaction between user and visual embeddings. The gradient ascent method is used to train and update the parameters. The comparison with several other recommendation methods demonstrates that STVE has better results in recommendation quality and ranking indicators. The experiment also analyzes how the components and main parameters of STVE influence the recommendation results. The main contributions of our study are summarized below: • Given the CF-based models' cold-start problems and the content-based models' low accuracy problems, we propose a hybrid recommendation model for tourist attractions that fuses spatial, temporal, and visual embeddings (STVE).

•
We modify Skip-gram's objective function to model the sequential factors in STVE, which takes advantage of Skip-gram's characteristics that handle the sequential data well and is more in line with the actual tourist attraction recommendation scenario. • Given the problems that the noisy and redundant photos may exert a bad influence on the extraction of visual embeddings and the recommendation results, we propose a framework that can automatically remove the noisy and redundant photos and select representative images to extract visual embeddings of the tourist attractions for further use.
The remainder of the paper is organized as follows. Section 2 reviews the related work on tourist attraction recommendations for social media data. Section 3 introduces the preliminary and the overall framework of the study, including data acquisition, data preprocessing, and model building and training steps. Section 4 presents the performance compared with other methods, the parameter sensitivity analysis, and the component-wise study. Section 5 summarizes this paper and discusses further study.

Related Work
Tourist attraction recommendation can be regarded as a type of location recommendation research. Similar to recommendation methods in other fields, location recommendation methods for social media data are comprised of content-and CF-based methods. Nevertheless, with the development of recommendation system techniques, an increasing number of methods are improved by combining both methods, incorporating context and content into CF, or fusing advanced machine learning methods. Such methods can no longer be classified into content-based or CF methods and can be collectively known as hybrid methods. The selection of contextual and content information for these methods has become a nontrivial issue in location recommendation research.
Regarding contextual information in location recommendation methods, sequential information is one of the commonly considered information. It is generally modeled based on the Markov model and its variations, which calculates the probability and makes recommendations according to the transition matrix from one location to another [14][15][16]. In recent years, plenty of researchers leveraged embedding methods to model sequential information due to embedding methods. For instance, Xie et al. learned the transition from one point of interest (POI) to another with Large Information Network Embedding (LINE) [17] and generated the embedding of each POI to recommend the next POI [18]. Zhao et al. leveraged Skip-Gram to model the POI visiting trajectory [19]. Other important contextual information is the geographical distance, as one of the typical characteristics of location recommendation is that it is constrained by geographical distance. There were two major ways to model geographical distance constraints in previous studies. One is to establish a simple inverse relationship between user's preference and geographical distance among locations, for instance, the power-law function [20,21], the Gaussian Model [22,23], and other reverse functions [24]. The other is to set a cutoff distance, and those locations whose distance from the current visiting location is larger than the cutoff distance would be filtered [15,19]. Apart from the sequential and geographical factors, other factors have also been considered in the location recommendation research, including temporal factors [25,26], the category of the locations [27], etc. The studies above considered one or two factors in their recommendation models, but few have fully integrated various factors that may affect the recommendation accuracy, not to mention the combination of content information.
The content information includes user characteristics [28][29][30], tags [31,32], and visual information. Visual information is relatively less considered because of the difficulty of extracting accurate visual information and noisy visual content in user-generated photos. Some researchers leveraged Scale-Invariant Feature Transform (SIFT) or color histograms to extract visual information [33,34], but these hand-crafted features limit the accuracy of visual information extraction to a great extent. The rise of the Convolutional neural network greatly improves visual information representation and has been applied in recommendation methods with visual content [21,35]. However, the imbalance of the number of photos in each tourist attraction and the noise and redundancy in photos still affect visual information's representativeness. The recommendation accuracy of solely using recommendation methods based on visual content is relatively low, and the combination with other contextual information is still needed.

Preliminary and Framework
Before we introduce our dataset and methods, some terms need to be declared for better understanding:

Definition 1 (Geotagged photo).
A geotagged photo is a photo with location information taken by users, represented as p. Each photo contains the identification code id, the taken time t, the taken coordinate g, the user u,and the attached tag set X.

Definition 2 (Photo collection).
A photo collection is all the geotagged photos in the study area within a certain time, represented as P = p 1 , p 2 , . . . , p |P| .

Definition 3 (Semantic location).
A semantic location is a location with unique semantic extracted by spatial clustering, represented as l. In our study, the extracted semantic location is a tourist attraction.

Definition 4 (Visit).
A visit means a user's visit to a tourist attraction within a certain time and space, represented as v = l, u, t, P l,u,t . P l,u,t represents the photo collection that the user u took when visiting the tourist attraction l at time t.

Definition 5 (User visiting trajectory). A user visiting trajectory is the trajectory that records
all visits of the user in chronological order, represented as T u i = v i1, v i2 , . . . , v i|T u i | . Figure 1 shows the overall framework of our study, including preprocessing steps and model building steps. Each step is illustrated in detail in the following sections.

Dataset and Study Area
We leverage Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) [36] as the experimental dataset because it can be easily downloaded from Amazon Web Services (AWS) and can provide an adequate amount of geotagged photo data. Furthermore, Menk et al. summarized that most previous studies related to tourism recommendation also used Flickr data [37], indicating its applicability in tourism research. The features of each photo we mainly use include the ID of each geotagged photo, user ID, capture time,

Dataset and Study Area
We leverage Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) [36] as the experimental dataset because it can be easily downloaded from Amazon Web Services (AWS) and can provide an adequate amount of geotagged photo data. Furthermore, Menk et al. summarized that most previous studies related to tourism recommendation also used Flickr data [37], indicating its applicability in tourism research. The features of each photo we mainly use include the ID of each geotagged photo, user ID, capture time, longitude and latitude, user tags, and the images themselves.
We select the geotagged photos whose coordinates are bounded in the study area and taken within a certain time, and Tokyo is selected as the study area to evaluate our model. Tokyo is the capital city of Japan, which is also a famous tourist city. In 2018, the number of inbound tourists to Tokyo was approximately 14.24 million, and the expenditure of inbound tourists in Tokyo was about JPY 1.19 trillion [38]. Figure 2 shows the spatial distribution of Flickr photos in Tokyo. A total of 145,397 photos bounded in Tokyo and uploaded by 2750 users were used in the following experiment.

Data Preprocessing
Before the STVE model is built, some preprocessing steps are needed, including spatial clustering of tourist attractions, obtaining the visual embedding of each tourist attraction, and constructing user visiting trajectory.

Spatial Clustering of Tourist Attractions
As the location information is represented as the latitude and longitude in the raw Flickr dataset, it is indispensable to cluster geotagged photos and obtain tourist attractions. We followed the clustering method in our previous study, namely the clustering method considering the spatial and semantic distance, which has proven to be effective to cluster fine-grained tourist attractions in the dense area of photos [39]. Ninety-nine tourist attractions were obtained in Tokyo after clustering, and most of them are in Chuo Ku, Minato Ku, and Chiyoda Ku. Some are shown in Figure 3

Data Preprocessing
Before the STVE model is built, some preprocessing steps are needed, including spatial clustering of tourist attractions, obtaining the visual embedding of each tourist attraction, and constructing user visiting trajectory.

Spatial Clustering of Tourist Attractions
As the location information is represented as the latitude and longitude in the raw Flickr dataset, it is indispensable to cluster geotagged photos and obtain tourist attractions. We followed the clustering method in our previous study, namely the clustering method considering the spatial and semantic distance, which has proven to be effective to cluster fine-grained tourist attractions in the dense area of photos [39]. Ninety-nine tourist attractions were obtained in Tokyo after clustering, and most of them are in Chuo Ku, Minato Ku, and Chiyoda Ku. Some are shown in Figure 3, including Tokyo Tower (Figure 3b

Visual Embedding Extraction
After clustering, we leveraged a pre-trained deep ranking model to obtain each tourist attraction's visual embedding representation. The deep ranking model is a convolutional-based model aiming at image retrieval with fine-grained visual similarity [40]. Input each photo into the deep ranking model and will obtain a 2048-dimension embedding. It should be noted that the number of photos in each tourist attraction is not the same, and there are some photos whose visual content is unrelated to the tourist attraction (for instance, selfie). Therefore, calculating all photos' embedding values and taking the average is not suitable to be the embedding representation of each tourist attraction. To obtain a more accurate visual representation, we made two improvements. First, we filtered two kinds of noisy photos before the photos are input into the deep ranking model: the photos whose content is mainly occupied by people are detected and removed by a single-shot multibox detector (SSD) model [41], and the photos that mainly displayed the objects are filtered by Multilayer Perceptron pre-trained by the Caltech 101 dataset [42] and the Places2 dataset [43]. Second, after obtaining the embeddings of the remaining photos from the deep ranking models, we calculate the Euclidean distance of each embedding from all other embeddings and sort them in ascending order. If the distance between the two embeddings is small, the corresponding two photos' visual content is similar. Therefore, if an embedding's distance among all other embeddings is small, this photo's visual content is comparatively typical and representative. For each tourist attraction, we calculated the average of the top embeddings with the smallest distance from other embeddings, and the result will be further used as the visual embedding of this tourist attraction, represented as ̅ :

Visual Embedding Extraction
After clustering, we leveraged a pre-trained deep ranking model to obtain each tourist attraction's visual embedding representation. The deep ranking model is a convolutionalbased model aiming at image retrieval with fine-grained visual similarity [40]. Input each photo into the deep ranking model and will obtain a 2048-dimension embedding. It should be noted that the number of photos in each tourist attraction is not the same, and there are some photos whose visual content is unrelated to the tourist attraction (for instance, selfie). Therefore, calculating all photos' embedding values and taking the average is not suitable to be the embedding representation of each tourist attraction. To obtain a more accurate visual representation, we made two improvements. First, we filtered two kinds of noisy photos before the photos are input into the deep ranking model: the photos whose content is mainly occupied by people are detected and removed by a single-shot multibox detector (SSD) model [41], and the photos that mainly displayed the objects are filtered by Multilayer Perceptron pre-trained by the Caltech 101 dataset [42] and the Places2 dataset [43]. Second, after obtaining the embeddings of the remaining photos from the deep ranking models, we calculate the Euclidean distance of each embedding from all other embeddings and sort them in ascending order. If the distance between the two embeddings is small, the corresponding two photos' visual content is similar. Therefore, if an embedding's distance among all other embeddings is small, this photo's visual content is comparatively typical and representative. For each tourist attraction, we calculated the average of the top n embeddings with the smallest distance from other embeddings, and the result will be further used as the visual embedding of this tourist attraction, represented as e l j : where e l jk represents the k-th embedding in the top n list of the j-th tourist attraction, and we set n as 50 in this study. The visual embedding of each tourist attraction was fused into the recommendation model.

User Visiting Trajectory Construction
Constructing the user visiting trajectory is needed to be the training data of the STVE model. Unlike Foursquare or other social media check-in data that can connect a user's check-in records in chronological order to be the user visiting trajectory, the user of geotagged photos may take more than one photo when visiting a tourist attraction within a short time (as shown in the three photos in l 2 in Figure 4). Another inevitable problem is that some photos cannot be clustered into any tourist attraction due to the nature of density-based clustering with noise. Therefore, we set a time threshold ∆t and a distance threshold ∆dis to judge whether the photos taken at the adjacent time should be merged as the same visit. Sort each user's photos in chronological order, starting from the first photos and looping through them. If the current photo and the next photo have been clustered into the same attraction and the interval of their shooting time is less than ∆t, merge them as the same visit. If at least one of the current photo and the next photo is not clustered, judge whether the shooting time interval is less than ∆t and the distance of the two photos is less than ∆dis. If both are true, merge them as the same visit. After constructing the user visiting trajectory, we remove users who visited no more than four attractions, and the final number of trajectories (users) is 1,801. We select the former 80% of each trajectory as the training data and the remaining 20% as the final evaluation test data. where represents the -th embedding in the top list of the -th tourist attraction, and we set as 50 in this study. The visual embedding of each tourist attraction was fused into the recommendation model.

User Visiting Trajectory Construction
Constructing the user visiting trajectory is needed to be the training data of the STVE model. Unlike Foursquare or other social media check-in data that can connect a user's check-in records in chronological order to be the user visiting trajectory, the user of geotagged photos may take more than one photo when visiting a tourist attraction within a short time (as shown in the three photos in 2 in Figure 4). Another inevitable problem is that some photos cannot be clustered into any tourist attraction due to the nature of density-based clustering with noise. Therefore, we set a time threshold ∆ and a distance threshold ∆ to judge whether the photos taken at the adjacent time should be merged as the same visit. Sort each user's photos in chronological order, starting from the first photos and looping through them. If the current photo and the next photo have been clustered into the same attraction and the interval of their shooting time is less than ∆ , merge them as the same visit. If at least one of the current photo and the next photo is not clustered, judge whether the shooting time interval is less than ∆ and the distance of the two photos is less than ∆ . If both are true, merge them as the same visit. After constructing the user visiting trajectory, we remove users who visited no more than four attractions, and the final number of trajectories (users) is 1,801. We select the former 80% of each trajectory as the training data and the remaining 20% as the final evaluation test data.

Model Description and Optimization
In the following section, we describe our STVE model, including the spatial-temporal embedding part, the visual embedding part, and model optimization. The structure of STVE and the connection between preprocessing steps are shown in Figure 5, and some important notations in the STVE model are listed in Table 1.

Notation
Description T the training dataset for all users in the study area , a user and a tourist attraction , the -th and ( + 1)-th tourist attractions visited by the user the time slot of the user to visit his/her -th attractions , the -dimensional embedding representations of and the -dimensional embedding representations of the -dimensional embedding representations of user ̅ the visual embeddings of representative images for the attraction

Model Description and Optimization
In the following section, we describe our STVE model, including the spatial-temporal embedding part, the visual embedding part, and model optimization. The structure of STVE and the connection between preprocessing steps are shown in Figure 5, and some important notations in the STVE model are listed in Table 1. ual embeddings , the negative sample attractions for spatial-temporal embeddings and visual embeddings the number of dimensions for , and . the number of dimensions for the number of dimensions for visual embeddings × embedding matrix Figure 5. The STVE's structure and its connection with the preprocessing steps.

Spatial-Temporal Embedding
We first modeled the sequential characteristics of tourist visiting trajectories with Skip-gram's principle, because Skip-gram, as a kind of Word2Vec methods, can well handle the sequential data like sentences. The objective function of Skip-gram is to maximize the probability of the contextual words given the center word, represented as: where represents the whole training corpus, and represents the contextual word of within the window size . Both and belong to the corpus . We regard one user visiting trajectory as a sentence and each tourist attraction in trajectory as each word. With Skip-gram's objective function, we infer the contextual tourist attractions given the center attractions in the trajectory. However, in the scenario of natural language sentences, the strategy of contextual word selection of Word2Vec is without direction, while in the tourist attraction recommendation scenario, it is more in line with the actual situation to predict the next attraction given the currently visited attraction. In this way, we modify the conditional probability as Equation (3): where T represents the whole visiting trajectories of all users; and represent the -th and ( + 1)-th visited tourist attractions of trajectories belonging to user respectively, and = 1,2, … , T − 1. ( | ) represents the conditional probability from to . Similar to the target of Word2Vec sentence training, Equation (3) maximizes these conditional probabilities in the whole dataset T.

Notation Description
T the training dataset for all users in the study area u i , l j a user i and a tourist attraction j l k , l k+1 the k-th and (k + 1)-th tourist attractions visited by the user i t k+1 the time slot of the user i to visit his/her k-th attractions v l k , v l k+1 the f 1 -dimensional embedding representations of l k and l k+1 v t k+1 the f 1 -dimensional embedding representations of t k+1 v u i the f 1 -dimensional embedding representations of user i e l j the visual embeddings of representative images for the attraction j k ne , k ne the number of negative samples for spatial-temporal embeddings and visual embeddings l ne , l ne the negative sample attractions for spatial-temporal embeddings and visual embeddings f 1 the number of dimensions for v l k , v l k+1 and v t k+1 . f 2 the number of dimensions for v u i f e the number of dimensions for visual embeddings W ul f 2 × f e embedding matrix

Spatial-Temporal Embedding
We first modeled the sequential characteristics of tourist visiting trajectories with Skip-gram's principle, because Skip-gram, as a kind of Word2Vec methods, can well handle the sequential data like sentences. The objective function of Skip-gram is to maximize the probability of the contextual words given the center word, represented as: where C represents the whole training corpus, and w i represents the contextual word of w t within the window size k. Both w t and w i belong to the corpus C. We regard one user visiting trajectory as a sentence and each tourist attraction in trajectory as each word.
With Skip-gram's objective function, we infer the contextual tourist attractions given the center attractions in the trajectory. However, in the scenario of natural language sentences, the strategy of contextual word selection of Word2Vec is without direction, while in the tourist attraction recommendation scenario, it is more in line with the actual situation to predict the next attraction given the currently visited attraction. In this way, we modify the conditional probability as Equation (3): where T represents the whole visiting trajectories of all users; l k and l k+1 represent the k-th and (k + 1)-th visited tourist attractions of trajectories belonging to user u i respectively, and k = 1, 2, . . . , |T u i | − 1. P(l k+1 |l k ) represents the conditional probability from l k to l k+1 . Similar to the target of Word2Vec sentence training, Equation (3) maximizes these conditional probabilities in the whole dataset T. Apart from the influence of sequential characteristics, the time of the day may also influence users' selection of visiting attractions. The heat map in Figure 6 shows the users' visiting patterns of fifty randomly selected Tokyo attractions at different hours within one day. It can be seen that the visiting patterns for different tourist attractions are not the same. For instance, the visiting hours for the first two attractions (ID 0 and 1) in Figure 6 are concentrated between 10 a.m. and 4 p.m., while some tourist attractions (such as ID 43 and 44) are discretely distributed between 11 a.m. and 10 p.m. Therefore, the recommendation model should also be considered the influence of the time of the day. Apart from the influence of sequential characteristics, the time of the day may also influence users' selection of visiting attractions. The heat map in Figure 6 shows the users' visiting patterns of fifty randomly selected Tokyo attractions at different hours within one day. It can be seen that the visiting patterns for different tourist attractions are not the same. For instance, the visiting hours for the first two attractions (ID 0 and 1) in Figure 6 are concentrated between 10 a.m. and 4 p.m., while some tourist attractions (such as ID 43 and 44) are discretely distributed between 11 a.m. and 10 p.m. Therefore, the recommendation model should also be considered the influence of the time of the day.
Suppose the target recommendation scenario is to infer the most likely visiting attractions given the previous visiting attractions and the current time, the equation adding the temporal factor based on Equation (3) can be formulated as the following: represents the time that user visited the ( + 1)-th tourist attraction. We map the time of the day into integer values from 0 to 23 to avoid the problems of too many time slots and data sparsity. For instance, if a user visits a tourist attraction between 8 a.m. and 9 a.m. (not including 9 a.m.), the visiting time will be mapped to 8. and train the latent factors of , and (denoted as , and , respectively). Two symbols and are introduced for a better description, and they are defined as follows: where ⊕ represents the concatenation operator. The inner product of and can be denoted as follows: ) can be formulated as: However, the cost of computing Equation (5) is impractically high because of the SoftMax function. Therefore, the negative sampling method is leveraged as a computationally efficient approximation algorithm in Equation (4). Therefore, Equation (4) can be transformed into Equation (6): where t k+1 represents the time that user u i visited the (k + 1)-th tourist attraction. We map the time of the day into integer values from 0 to 23 to avoid the problems of too many time slots and data sparsity. For instance, if a user visits a tourist attraction between 8 a.m. and 9 a.m. (not including 9 a.m.), the visiting time will be mapped to 8. The SoftMax function is used to define the conditional probability P(l k+1 |l k , t k+1 ) and train the latent factors of l k , l k+1 and t k+1 (denoted as v l k , v l k+1 and v t k+1 , respectively). Two symbolsv t c andv n are introduced for a better description, and they are defined as follows:v t c = v l k ⊕ v t k+1 ,v n = v l k+1 ⊕ v l k+1 , where ⊕ represents the concatenation operator.
The inner product ofv t c andv n can be denoted as follows:v n ·v t c = v l k · v l k+1 + v t k+1 · v l k+1 . Then P(l k+1 |l k , t k+1 ) can be formulated as: However, the cost of computing Equation (5) is impractically high because of the SoftMax function. Therefore, the negative sampling method is leveraged as a computationally efficient approximation algorithm in Equation (4). Therefore, Equation (4) can be transformed into Equation (6): where σ(x) is the Sigmoid function; l ne represents the negative sample attractions, and k ne is the number of negative samples. Due to spatial distances constraint, tourists may prefer a tourist attraction closer to the current visiting tourist attraction. In other words, tourists are less likely to choose a tourist attraction far away from the current visiting one. Therefore, we introduce this idea of spatial distance constraint to the process of negative sampling, i.e., the negative samples are not randomly chosen but are chosen from those attractions whose distance with the current visiting attraction is larger than a predefined distance threshold. The set of negative samples can be formulated as Equation (7): where ∆dis is a predefined distance threshold. Substitute L ne with L g ne in Equation (6), and the final representation of spatial-temporal embedding can be represented as:

Visual Embedding
As analyzed above, the visual factor is also one of the essential factors that impact tourists' decision to choose tourist attractions. Therefore, the recommendation model should be fused with visual information. Enlightened by Visual Bayesian Personalized Ranking (VBPR) proposed by He et al. [44], we also try to fuse tourist attractions' visual embeddings into matrix factorization and Bayesian Personalized Ranking. The matrix factorization of "users-tourist attractions" can be established as: where v u i is the embedding of the user i, and v l j is the embedding of the tourist attraction j.
The dimension of them are both f 2 . We leverage the inner product of the visual embedding e l j generated in Section 3.3 and a parameter matrix W ul to represent v l j : v l j = W ul ·e l j where W ul is a f 2 × f e parameter matrix, and f e is the number of dimensions for visual embedding e l j (2,048 as mentioned above). Substitute v l j with Equation (10) and further introduce the bias term β in Equation (9), the representation of visual embedding can be formulated as: Optimize L V with VBPR, which assumes that the user prefers this attraction over all other attractions. Randomly select the negative sample l ne . Suppose the number of negative samples is k ne , then L V can be formulated as: For the whole training dataset, the users' visual preference for the tourist attractions can be modeled as Equation (13):

Model Learning
We combine Equation (8) with Equation (13) by the linear weighted sum method, and the objective function of the proposed STVE model that fuses spatial, temporal, and visual information, which is formulated as: where θ is the parameter set that can maximize the value of (α·L The detail of the learning process of STVE is shown in Algorithm 1. The input of STVE training includes the training dataset T and the parameter set θ. We leverage minibatch gradient ascent to update the parameters, and we set the ratio parameter b as 0.5. The training epoch is max_epoch, and η 1 and η 2 are the learning rate of L ST and L V , respectively. ∆dis is the parameter of the cutoff distance threshold. First, initialize all the parameters with a normal distribution (Line 1), and the formula for updating parameters is as follows: (15) where η is the learning rate. Update v t k+1 , v l k and v l k+1 for each user's top ( T u j − 1) visited attractions (Line 6 to 9), and select k ne negative samples from the L g ne to update v t k+1 , v l k and v l m (Line 10 to 14), which is the updating process of the parameters in the L ST part. Regarding the L V part, we definex u i l k l n as: x u i l k l n =x u i l k −x u i l n = v u i · W ul · e l k − e l n + β· e l k − e l n (16) wherex u i l k l n andx u i l n are defined by Equation (11). Update v u i , β and W ul with VBPR (Line 17 to 20): ln σ x u i l j l k − λ θ ||θ|| 2 (17) where λ θ is the regularization parameter. Here we set v u i ,β and W ul as the same value λ. Initialize θ with Normal Distribution

. Evaluation Metrics
We leverage four metrics to evaluate our STVE model, including precision, recall, mean reciprocal rank (MRR), and mean distance error (MDE). Precision@N refers to the proportion of the ground-truth tourist attractions that are included in the top-N recommended list, and Recall@N means the ratio between the number of ground-truth attractions in top-N recommended results and the number of tourist attractions that the user visited; these are two common metrics to evaluate the recommendation quality and can be formulated as Equations (18) and (19), respectively: where R N (u i ) is the set of top-N recommendation results, and l k represents the actual tourist attraction that the user visited.
MRR is the recommendation ranking metrics, which is defined as: (20) where rank l k is the ranking number of the ground-truth attraction in the recommended list. MDE calculates the average minimum geographical distance between the groundtruth tourist attraction and any of the top-N predicted attractions. It is not a general metric to evaluate the recommendation system, but it can be used to evaluate the distance error of the recommendation results, which was also used in the study of Yao et al. [45] related to location prediction. MDE can be formulated as: where l kn represents any attraction in the top-N list, and dis(x, y) represents the distance between x and y, defined by Haversine distance. A smaller value of MDE indicates better performance in distance error.

Comparison Methods
We chose several recommendation models to compare the performance of our model, including: • User-based Collaborative Filtering (UCF): UCF is a classic memory-based recommendation model that mainly uses other users with similar preferences to make recommendations [46,47].  [15] is a recommendation method that was improved by adding the geographical distance constraint to Factorizing Personalized Markov Chain methods (FPMC) [14]. • VBPR: VBPR is a matrix factorization model with visual information aimed at online shopping recommendations [44]. • Geo-Teaser: Geo-Teaser was a method that integrates temporal and geographical information with the negative sampling strategy of Word2Vec and hierarchical pairwise ranking to make recommendations [19].  Figure 7, we can conclude that the STVE model outperforms any other baseline methods in four metrics. It performs particularly well on recall and MRR, indicating the relatively high proportion of the recommended results hitting the ground-truth tourist attractions, and the high average ranking of the ground-truth tourist attractions in the recommended list. Additionally, the STVE model's performance in MDE is also superior to other models; the gap between FPMC-LR and STVE is particularly large, which may be related to the different strategies of negative sample selection between them. The high MDE value of FPMC-LR demonstrates that selecting negative samples within the cutoff distance may not be in line with the actual situation, as a closer distance between the current attractions and the recommended ones seems to be more likely to attract tourists to visit them. reasons: first, when modeling the sequential information with Word2Vec, Geo-Teaser undirectedly takes the previous and the next visited attractions of the current attractions as the contextual "words". STVE improves the conditional probability in Equation (3) as predicting next attraction given the current attraction is more in line with the actual situation; second, Geo-Teaser considers only spatial and temporal (sequential) factors, while STVE includes not only the above factors but also the visual factor, which may be another important reason that influences the recommendation accuracy.

Parameter Sensitivity Analysis
In this section, we discuss how the value of the parameters affects the results. The major parameters include the number of dimensions and , the number of negative samples and , and learning rate and . We mainly use Recall@2 and MDE@3 to compare the performance. For each value of the parameters, we repeat the experiment three times and take the average results. We also tune the linear weight α as 0.5 temporarily to reduce the influence of different weights of the two components on the result.

Impact of Dimension
We first discuss the impact of the dimension number and . Figures 8 and 9 show the line chart with error bars of the impact of and , respectively. We vary the value of from 10 to 50 with a step of 10, and that of from 40 to 100. When the value of from 10 to 20, Recall@2 value has increased significantly. However, the increase slows down and remains almost steady when value varies from 20 to 50. Similarly, when the number of reaches 70 or 80, the increase of Recall@2 value slows down, and even has a little fluctuation. The common trend is that while the number of dimensions increases, the performance improves, but the time cost also increases. The difference is that the value of does not influence the result as much as that of , but its influence to time cost is much larger than that of because the embedding with dimension needs to be inner product with high-dimensional visual embeddings. Therefore, as analyzed above, we set the value of as 40 and as 60 in this paper. We further analyze the results of other baseline models. As the only memory-based CF method, the UCF's performance is far inferior to other models in four metrics, revealing the difficulty of memory-based CF in solving the data sparsity and cold-start problem. On the other hand, even though BPR-MF, the classic model-based CF, does not fuse any contextual and content information, it still performs better than UCF, which implies that selecting the model-based CF as the basic method of the STVE model can effectively improve the recommendation accuracy and overcome the problem of data sparsity compared with the memory-based CF. FPMC-LR and VBPR are improved models that add context or content based on BPR-MF. Both obtain better results than BPR-MF, which shows that selecting and fusing appropriate context and content into matrix factorization can improve the recommending accuracy. Finally, as a model that owns the most similar model structure and factors with STVE, Geo-Teaser is superior to all other baseline models in performance. Though such results imply the advantages of the model structure of Geo-Teaser, the performance of Geo-Teaser still ranks second to STVE. It may be due to two reasons: first, when modeling the sequential information with Word2Vec, Geo-Teaser undirectedly takes the previous and the next visited attractions of the current attractions as the contextual "words". STVE improves the conditional probability in Equation (3) as predicting next attraction given the current attraction is more in line with the actual situation; second, Geo-Teaser considers only spatial and temporal (sequential) factors, while STVE includes not only the above factors but also the visual factor, which may be another important reason that influences the recommendation accuracy.

Parameter Sensitivity Analysis
In this section, we discuss how the value of the parameters affects the results. The major parameters include the number of dimensions f 1 and f 2 , the number of negative samples k ne and k ne , and learning rate η 1 and η 2 . We mainly use Recall@2 and MDE@3 to compare the performance. For each value of the parameters, we repeat the experiment three times and take the average results. We also tune the linear weight α as 0.5 temporarily to reduce the influence of different weights of the two components on the result.

Impact of Dimension
We first discuss the impact of the dimension number f 1 and f 2 . Figures 8 and 9 show the line chart with error bars of the impact of f 1 and f 2 , respectively. We vary the value of f 1 from 10 to 50 with a step of 10, and that of f 2 from 40 to 100. When the value of f 1 from 10 to 20, Recall@2 value has increased significantly. However, the increase slows down and remains almost steady when f 1 value varies from 20 to 50. Similarly, when the number of f 2 reaches 70 or 80, the increase of Recall@2 value slows down, and even has a little fluctuation. The common trend is that while the number of dimensions increases, the performance improves, but the time cost also increases. The difference is that the value of f 2 does not influence the result as much as that of f 1 , but its influence to time cost is much larger than that of f 1 because the embedding v u i with f 2 dimension needs to be inner product with high-dimensional visual embeddings. Therefore, as analyzed above, we set the value of f 1 as 40 and f 2 as 60 in this paper.

Impact of Negative Samples
The impact of negative samples is less discussed, compared to that of the dimension. Figures 10 and 11 shows the impact of and , respectively. It seems that the negative sample number increase does not necessarily make the performance better: the performance of the two metrics get slightly better when value increases, while the performance becomes even worser as value increases. Nevertheless, the number of negative samples does not influence the result as much as that of dimension, and the overall performance can still remain satisfactory. Therefore, we obtain the value of and intuitively from the charts, which are set as 5 and 1, respectively.

Impact of Negative Samples
The impact of negative samples is less discussed, compared to that of the dimension. Figures 10 and 11 shows the impact of and , respectively. It seems that the negative sample number increase does not necessarily make the performance better: the performance of the two metrics get slightly better when value increases, while the performance becomes even worser as value increases. Nevertheless, the number of negative samples does not influence the result as much as that of dimension, and the overall performance can still remain satisfactory. Therefore, we obtain the value of and intuitively from the charts, which are set as 5 and 1, respectively.

Impact of Negative Samples
The impact of negative samples is less discussed, compared to that of the dimension. Figures 10 and 11 shows the impact of k ne and k ne , respectively. It seems that the negative sample number increase does not necessarily make the performance better: the performance of the two metrics get slightly better when k ne value increases, while the performance becomes even worser as k ne value increases. Nevertheless, the number of negative samples does not influence the result as much as that of dimension, and the overall performance can still remain satisfactory. Therefore, we obtain the value of k ne and k ne intuitively from the charts, which are set as 5 and 1, respectively. Figures 10 and 11 shows the impact of and , respectively. It seems that the negative sample number increase does not necessarily make the performance better: the performance of the two metrics get slightly better when value increases, while the performance becomes even worser as value increases. Nevertheless, the number of negative samples does not influence the result as much as that of dimension, and the overall performance can still remain satisfactory. Therefore, we obtain the value of and intuitively from the charts, which are set as 5 and 1, respectively.

Impact of Learning Rate
and are the learning rates of ℒ and ℒ part, respectively. Setting different learning rates for combined models has been tried in previous studies [19]. Figure 12 shows how the combination of and values influence the recall@2 value. In the experiment, we varied from 0.001 to 0.075, and from 0.001 to 0.0075 because we find that that STVE becomes drastically worse when is larger than 0.001. This may be because a too large learning rate leads to divergence. When is equal to 0.001, the Recall@2 value is generally high. Additionally, within the range from 0.001 to 0.01 of value, the overall result also gets better as the value increases. After value is larger than 0.01, the result remains steady. We select the value of and when they together achieve the optimal point, and the value of and is 0.001 and 0.01, respectively.  4.3.3. Impact of Learning Rate η 1 and η 2 are the learning rates of L ST and L V part, respectively. Setting different learning rates for combined models has been tried in previous studies [19]. Figure 12 shows how the combination of η 1 and η 2 values influence the recall@2 value. In the experiment, we varied η 2 from 0.001 to 0.075, and η 1 from 0.001 to 0.0075 because we find that that STVE becomes drastically worse when η 1 is larger than 0.001. This may be because a too large learning rate leads to divergence. When η 1 is equal to 0.001, the Recall@2 value is generally high. Additionally, within the range from 0.001 to 0.01 of η 2 value, the overall result also gets better as the value increases. After η 2 value is larger than 0.01, the result remains steady. We select the value of η 1 and η 2 when they together achieve the optimal point, and the value of η 1 and η 2 is 0.001 and 0.01, respectively.

Impact of Learning Rate
and are the learning rates of ℒ and ℒ part, respectively. Setting different learning rates for combined models has been tried in previous studies [19]. Figure 12 shows how the combination of and values influence the recall@2 value. In the experiment, we varied from 0.001 to 0.075, and from 0.001 to 0.0075 because we find that that STVE becomes drastically worse when is larger than 0.001. This may be because a too large learning rate leads to divergence. When is equal to 0.001, the Recall@2 value is generally high. Additionally, within the range from 0.001 to 0.01 of value, the overall result also gets better as the value increases. After value is larger than 0.01, the result remains steady. We select the value of and when they together achieve the optimal point, and the value of and is 0.001 and 0.01, respectively.

Component-Wise Study
We further explore how each component affects the performance as STVE is a combined model considering various factors. We split each component and compare their performance, and each component/model includes: (1) Only the spatial and temporal part ℒ (marked as "ST" in the following), i.e., α is set as 1 in Equation (14). (2) ℒ that

Component-Wise Study
We further explore how each component affects the performance as STVE is a combined model considering various factors. We split each component and compare their performance, and each component/model includes: (1) Only the spatial and temporal part L ST (marked as "ST" in the following), i.e., α is set as 1 in Equation (14). (2) L ST that removes the spatial distance constraint, i.e., α is set as 1 and select the negative samples randomly instead of selecting in the negative sample set outside the cutoff distance (marked as "T"). (3) Only the visual part L V , i.e., α is set as 0 (marked as "V"). (4) The complete STVE model. Table 2 shows the result comparison between STVE and its components in the metrics of precision@2, recall@2, MDE@3, and MRR. Each component's performance is not as good as that of the complete STVE model-a common result in combined models. Among these three components, ST's gap between STVE is relatively small; followed by the T component. The V component performs worst when solely used, but after combining with the ST component, the overall performance improves compared to using ST. Additionally, fusing content information can also play an important role in solving the cold-start problem. The result shows that our STVE model can effectively improve recommendation accuracy compared with any single component.

Results for Cold-Start User
We briefly analyze how STVE performs in the cold-start issue. We assume the users who visited less than six tourist attractions to be the cold-start users and remove the other users. The number of the remaining trajectories is 659. Train STVE and five baseline methods with these remaining trajectories and further compare their performance in four metrics. As Figure 13 shows, our STVE model still obtains the best performance among these models, but its gap with other baseline models is larger. For instance, for all users in Figure 7, the MRR value difference between Geo-Teaser and UCF is 0.0792 and 0.2528, respectively, whereas, for cold-start users, the MRR value difference is 0.1531 and 0.3837, respectively. Additionally, the performance of VBPR in the former three metrics is not much different from FPMC-LR and is inferior to Geo-Teaser for all users, whereas for cold-start users, VBPR performs second to STVE. It may be because VBPR is a kind of content-based model, and the content-based model is not sensitive to the cold-start issue. Similarly, STVE fuses visual content based on the principle of VBPR and therefore obtains good performance in the cold-start issue. Another impressive result is that the difference between UCF and BPR-MF for cold-start users is smaller than that for all users. Take the MRR value as an example again; the difference between UCF and BPR-MF for all users is 0.0734, whereas that for cold-start users is 0.0381. It demonstrates that both memory-and model-based CF will be negatively affected by the cold-start issue when they do not fuse any content and contextual information.

Conclusions
In this paper, we propose a hybrid tourist attraction recommendation model that fuses spatial, temporal, and visual embeddings for Flickr-geotagged photos (STVE). In the preprocessing steps, we leverage a framework to automatically filter the noisy and redundant photos and select representative images of tourist attractions to extract visual embeddings as accurately as possible. To build the STVE model, we modify Skip-gram's objective function and leverage Word2Vec's negative sampling strategy to model the spatial and temporal factors. Then we use Matrix Factorization to fuse the tourist attractions' visual embeddings and train with Visual Bayesian Personalized Ranking. We select Tokyo as the study area to evaluate our STVE model.
The comparison results show that our STVE model can relieve the low accuracy issue of content-based methods and the cold-start issue of CF-based methods. We also analyzed the sensitivity of the main parameters and explore how each component influences the recommendation results. The series of results demonstrate the superiority of STVE in providing a recommendation of high accuracy and provide us with further motivation to pursue our research. In future work, we will continue to improve our recommendation models by adding more contextual information (such as weather and season) and user attributes (such as age and gender). Furthermore, we will try to implement our model in web-based applications or other platforms for actual use.

Conclusions
In this paper, we propose a hybrid tourist attraction recommendation model that fuses spatial, temporal, and visual embeddings for Flickr-geotagged photos (STVE). In the preprocessing steps, we leverage a framework to automatically filter the noisy and redundant photos and select representative images of tourist attractions to extract visual embeddings as accurately as possible. To build the STVE model, we modify Skip-gram's objective function and leverage Word2Vec's negative sampling strategy to model the spatial and temporal factors. Then we use Matrix Factorization to fuse the tourist attractions' visual embeddings and train with Visual Bayesian Personalized Ranking. We select Tokyo as the study area to evaluate our STVE model.
The comparison results show that our STVE model can relieve the low accuracy issue of content-based methods and the cold-start issue of CF-based methods. We also analyzed the sensitivity of the main parameters and explore how each component influences the recommendation results. The series of results demonstrate the superiority of STVE in providing a recommendation of high accuracy and provide us with further motivation to pursue our research. In future work, we will continue to improve our recommendation models by adding more contextual information (such as weather and season) and user attributes (such as age and gender). Furthermore, we will try to implement our model in web-based applications or other platforms for actual use. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://projects.dfki.uni-kl.de/yfcc100m/.