A Convolutional Neural Network and Matrix Factorization-Based Travel Location Recommendation Method Using Community-Contributed Geotagged Photos

: Travel location recommendation methods using community-contributed geotagged photos are based on past check-ins. Therefore, these methods cannot e ﬀ ectively work for new travel locations, i.e., they su ﬀ er from the travel location cold start problem. In this study, we propose a convolutional neural network and matrix factorization-based travel location recommendation method to address the problem. Speciﬁcally, a weighted matrix factorization method is used to obtain the latent factor representations of travel locations. The latent factor representation for a new travel location is estimated from its photos by using a convolutional neural network. Experimental results on a Flickr dataset demonstrate that the proposed method can provide better recommendations than existing methods.


Introduction
The continued growth of photo-sharing sites (e.g., Flickr and Panoramio) has increased the volume of community-contributed geotagged photos (CCGPs) that are available on the Web. These large amounts of CCGPs include rich information (e.g., user-provided tags, time, and visual contents, as shown in Figure 1 and Table 1). This information is tremendously useful for travel location recommendation systems [1], taking into consideration users' travel preferences depending on their past check-ins, and developing travel location recommendation systems.
CCGPs usually include metadata, e.g., social relation, textual and contextual attributes [2][3][4][5]. Collaborative filtering (CF) is widely used by CCGP-based travel location recommendation methods, which are based on the simple conjecture of recommending a travel location to the user if similar users have interactions with the travel location [6,7]. Due to its simplicity, scalability, and flexibility, matrix factorization, which is a popular CF method that learns a latent factor to represent interaction ratings, becomes a widely used model for travel location recommendation [8,9]. Ratings in the user-travel location interaction are explicit information, which has been deeply exploited in early travel location recommendation methods. To address sparse ratings, auxiliary attributes, e.g., contextual and textual attributes [10,11], are integrated into matrix factorization. However, existing integrated methods only work on implicit/explicit rating prediction problems, and latent factor representation is not learned effectively with highly sparse content information. Therefore, these methods cannot    Recently, convolutional neural networks (CNN)s of powerful representation learning abilities shows high performance in different domains, e.g., signal processing [15] and natural language processing [16]. CNN effectively catches local features from different layers and transforms features to a single vector [17]. Therefore, CNN can be used to estimate the latent factor representations of new travel locations by providing comprehensive understanding of photos.
To address the above-mentioned problem, i.e., we cannot obtain the latent factor of a travel location from past check-ins; we propose to use a CNN to estimate the latent factor of a new travel location from its photos. We seamlessly integrate CNN into weighted matrix factorization (WMF), which is commonly used to recommend travel locations. In addition, we use similarity weight between users (and travel locations) to exploit auxiliary attributes, which are integrated into the WMF process to recommend travel locations. To the best of our knowledge, our work is the first to study the location cold start problem of travel location recommendation using photos. The main contributions of this study are summarized as follows. • Propose a CNNMF method that integrates CNN and WMF. If a travel location does not have any past check-ins, the method uses a CNN to estimate its latent factor representation from its photos.

•
Evaluate the proposed method on a CCGPs dataset that covers nine popular cities worldwide. Experimental results show that CNNMF can effectively address the travel location cold start problem and achieve competitive recommendation performance.
The remainder of the paper is organized as follows: Section 2 presents the related work. Section 3 presents the basic concepts and defines the problem. Section 4 introduces the motivation and design of the methodology. Section 5 presents the experiments. Section 6 concludes the paper and provides some future work recommendations.

Related Work
In this section, we survey some recent work related to our study, which contains: (1) CCGP-based travel location recommendations and (2) using extra data to address the location cold start problem.

CCGP-Based Travel Location Recommendation
CCGP-based travel location recommendation methods focus on two types of recommendation: general and personalized travel location recommendations. Methods of general travel location recommendation focus on recommending popular travel locations and sequences, which usually extract and cluster travel locations and sequences from past check-ins. Jiang et al. [18] proposed obtaining travel sequences from CCGPs by taking multiple attributes (e.g., time, cost, and tags) into account. Liu et al. [19] used a generative method and convolutions network to model users' check-in sequences. Other typical methods have been used to recommend locations and sequences in a given geospatial area [20,21]. Zheng et al. [20] analyzed the relationship between tourists' patterns and the regions of attraction. Chen et al. [21] fused geotagged photos and check-ins for route recommendation.
By contrast, methods of personalized travel location recommendation focus on recommending travel locations that are suitable for users' preferences. Majid et al. [1,22] proposed recommending travel locations by using users' preferences in past check-ins using different contexts (e.g., time and weather). In recent years, user attributes (e.g., age and gender) have been extracted from photo contents to construct user profiles [23,24]. Cheng et al. [23] extracted user attributes from photo contents, which was extended to include travel group type (e.g., couples, friends, and families) [24], and evolution continued by integrating the above-mentioned attributes with matrix factorization for travel location recommendation [9]. These methods are alternative ways to relieve the limitations of metadata by obtaining users' travel preferences from the contents of CCGPs (e.g., users' age and gender information), which cannot effectively work to mitigate the travel location cold start problem.

Using Extra Data to Address the Travel Location Cold Start Problem
Researchers have also focused on relieving the travel location cold start problem. Gao et al. [12,13] incorporated social network information with geo-social correlation to capture geographical distance and social network from location-based social networks to mitigate the problem. Sun et al. [3] proposed the integration of context information with a support vector machine to solve the regression problem. The main purpose is to extract more attributes to address the problem. However, few studies have investigated the travel location cold start problem. Despite their effectiveness for various data mining tasks, photo contents have not been studied for mitigating the problem.
Many studies have been conducted on using photo contents to extract travel location information, which is based on color, texture, or shape representation [25][26][27]. Ke et al. [25] proposed a method that transforms the photo annotation problem into a multi-label learning problem. Kuang et al. [26] estimated visual information from photos by associated tags in a local region. Xing et al. [27] extracted features from photos to obtain users' preferences and travel location properties. Weyand et al. [28] determined where a photo was taken only from its pixels by dividing the surface of the earth into thousands of multi-scale geographical cells. Wang et al. [29] proposed a method to extract visual contents from photos to help learn of latent factor representations; the probability method for learning fails to obtain accurate travel locations from hidden features. The reason is that each travel location has limited photos, which are insufficient to describe the travel location.
Our method is different from the above-mentioned methods. We use a CNN to estimate the latent factor of a new travel location from its photos. Latent factors obtained by applying WMF to the past check-ins are used as ground truths to train the CNN.
Our research problem can be formulated as follows. If a travel location does not have past check-ins, then CNNMF uses a CNN to estimate its latent factor representation from its photos, and seamlessly integrates CNN into WMF, from which we aim to recommend new travel locations to users.

Methodology
The framework of CNNMF is illustrated in Figure 2. The travel locations are determined using the spatial proximity of CCGPs. Then, from the visited travel locations, we build the interaction user-travel location matrix. Heterogeneous metadata are mined to exploit contextual attributes (i.e., time, weather, and season), textual attributes (i.e., tags), and geographical attributes (i.e., distance), which are incorporated into the WMF process to recommend travel locations. We also use a CNN to process the content of photos for estimating latent factors in travel location cold start cases.

Discovering Travel Locations from CCGPs
Discovering travel locations from CCGPs can be considered a clustering problem. Clustering algorithms, such as mean-shift, have been applied to discover travel locations from CCGPs [30]. The DBSCAN [31] algorithm has the following advantages compared with other clustering algorithms: (1) it requires minimum domain knowledge to determine parameters and identifies clusters with spot style; (2) it can work efficiently with large-scale data. However, the DBSCAN algorithm is unsuitable for extracting travel locations from CCGPs because of the different sizes and densities. To address this problem, Kisilevich et al. [32] presented a new clustering algorithm based on DBSCAN called P-DBSCAN, which is suitable for checking the place and event using a combination of CCGPs and can provide the definition of direct density reachable by utilizing adaptive density.
In our study, the P-DBSCAN clustering algorithm is used to find travel locations from CCGPs. We obtain a set of travel locations L = {l 1 , l 2 , . . . , l n }. Each location element is defined as l = {P l , cord l }, where P l is a collection of clustered photos and cord l at the geographical coordinates appears centroid of CCGPs.

Discovering Travel Locations from CCGPs
Discovering travel locations from CCGPs can be considered a clustering problem. Clustering algorithms, such as mean-shift, have been applied to discover travel locations from CCGPs [28]. The DBSCAN [29] algorithm has the following advantages compared with other clustering algorithms: (1) it requires minimum domain knowledge to determine parameters and identifies clusters with spot style; (2) it can work efficiently with large-scale data. However, the DBSCAN algorithm is unsuitable for extracting travel locations from CCGPs because of the different sizes and densities. To address this problem, Kisilevich et al. [30] presented a new clustering algorithm based on DBSCAN called P-DBSCAN, which is suitable for checking the place and event using a combination of CCGPs and can provide the definition of direct density reachable by utilizing adaptive density.
In our study, the P-DBSCAN clustering algorithm is used to find travel locations from CCGPs. We obtain a set of travel locations = { , , … , } . Each location element is defined as = { , }, where is a collection of clustered photos and at the geographical coordinates appears centroid of CCGPs.

Contextual Information Modeling
Time-stamp information allows the recovery of weather context , creating a time of day context , and gets the season context . Weather web services (WWSs) normally provide us the information related to weather status at the hourly, daily, or monthly foundation. By using WWSs with a time-stamp, we can discover context w (including temperature and sky condition) when visit

Contextual Information Modeling
Time-stamp information allows the recovery of weather context w, creating a time of day context t, and gets the season context s. Weather web services (WWSs) normally provide us the information related to weather status at the hourly, daily, or monthly foundation. By using WWSs with a time-stamp, we can discover context w (including temperature and sky condition) when visit β = (u, l, t) is made. We use the API of wunderground.com to obtain weather information. To obtain context t, we exploit the mean taken time of the photos belonging to a visit. Context s is then derived. The detailed definitions about time of day, season, and weather contexts are as follows: • Time of day: weekday AM, weekday PM, weekend AM, and weekend PM. Weather-sky condition: sunny, cloudy, rainy, snowy, and foggy.

Textual Information Modeling
The metadata of CCGPs have rich heterogeneous information (e.g., textual information). Tags are classified under textual information, which is necessary for modeling users and travel locations [7,33]. The topic model, e.g., latent Dirichlet allocation (LDA) [34], assumes each document is a collection (corpus) that can be described as a mixture of topics, where each topic is defined by a collection of "typical" or "likely" words. The graphical model representation of the LDA model is presented in Figure 3, which works as follows: • Select parameters θ i -Dir (α), where θ i is the topic distribution of document i and Dir (α) is the Dirichlet distribution of parameter α. • For each word: Select a word w-multinomial (β z ).

Textual Information Modeling
The metadata of CCGPs have rich heterogeneous information (e.g., textual information). Tags are classified under textual information, which is necessary for modeling users and travel locations [5,31]. The topic model, e.g., latent Dirichlet allocation (LDA) [32], assumes each document is a collection (corpus) that can be described as a mixture of topics, where each topic is defined by a collection of "typical" or "likely" words. The graphical model representation of the LDA model is presented in Figure 3, which works as follows:  Select parameters θ -r (α), where θ is the topic distribution of document and r (α) is the Dirichlet distribution of parameter .  For each word: I. Select a topic -multinomial (θ ). II. Select a word -multinomial (β ). Our method uses the topic model to gain the latent topic spread of users and travel locations for addressing the textual information of CCGPs. The tag set of all the photos of a travel location , as well as that of a user , is regarded as a document, and we use the topic model to obtain the topic distribution and .

Obtaining Explicit Information
After explicit attributes are obtained, user and travel location features are constructed as = ( , , , ) and = ( , , , , ) . The combination of the user and travel location features are applied in matrix factorization. represents the similarity matrix between two travel locations, and represents the similarity matrix between two users. Both and are utilized to help the factorization of user-travel location matrix. A similarity value is between 0 and 1, and a large value indicates high similarity.
( , ) is the geographical distance between two travel locations. Travel location-travel location similarity can be calculated by Equation (1): Our method uses the topic model to gain the latent topic spread of users and travel locations for addressing the textual information of CCGPs. The tag set of all the photos of a travel location l, as well as that of a user u, is regarded as a document, and we use the topic model to obtain the topic distribution txt l and txt u .

Obtaining Explicit Information
After explicit attributes are obtained, user and travel location features are constructed as u = (w u , s u , t u , txt u ) and l = (w l , s l , t l , txt l , dis l ) . The combination of the user and travel location features are applied in matrix factorization. M ll represents the similarity matrix between two travel locations, and M uu represents the similarity matrix between two users. Both M ll and M uu are utilized to help the factorization of user-travel location matrix. A similarity value is between 0 and 1, and a large value indicates high similarity. dis l j , l k is the geographical distance between two travel locations. Travel location-travel location similarity can be calculated by Equation (1): where x jq and x kq represent the q th feature of travel locations l j and l k , respectively. y is the number of features. User-user similarity can be calculated by Equation (2): where x ig and x kg . represent the g th feature of users u i and u k , respectively. y . is the number of features. Weighted travel locations and users similarities can be calculated by Equations (3) and (4), respectively.
Sim l l j , l k = a × sim l l j , l k + b × 1 ISPRS Int. J. Geo-Inf. 2020, 9, 464 7 of 16 where a, and b,. represent similarity weight between travel locations and c represents similarity weight between users, which are used to help the factorization of user-travel location interaction. The weights are set as follows: a = b = 1 5 and c = 1 4 .

Factorizing User-Travel Location Interaction
The user-travel location interaction plays an important role in the context of travel location recommendation. Let r ij be the number of times that user i has visited travel location j, which can be obtained from the past check-ins. To calculate the weighted effect of user and travel location, we use the WMF algorithm [35]. Let P ij be the preference of user i to travel location j, which is obtained by binarizing r ij , as shown in Equation (5). Let C ij be the confidence of P ij , which is obtained by Equation (6).
where γ and are hyper parameters. Suppose that u i ∈ R N×k be the users latent preferences, l j ∈ R M×k be the travel locations latent properties. The basic travel location recommendation method approximates u i s latent preferences in an unvisited l j by solving the following optimization problem.
where C ij ∈ R N×M is the check-in weighting matrix with C ij = 1 indicating that u i has checked in at l j , C ij = 0 otherwise. Following a previous work [9], the heterogeneous similarity information introduces user-user similarity and travel location-travel location similarity can be used to constrain a WMF for travel location recommendation, which is presented in Equation (8): where u i is the latent factor vector of user i, and l j is the latent factor vector of travel location j, the two regularization terms u i 2 F and l j 2 F are used to avoid overfitting, and G(i) and Q( j) are the user and travel location similarities of user i and travel location j, respectively. λ 1 and λ 2 are nonnegative parameters used to control the regularization terms and the similarity of regularization terms.

Exploiting Visual Content
With powerful representation learning abilities, CNN is widely used to improve the state-of-the-art, e.g., signal processing [15] and natural language processing [16]. CNN can effectively catch local features from different layers and transform features to a single vector [17]. We select the state-of-the-art CNN architecture VGG-16 [36], which consists of 16 layers, including 13 convolution, 3 fully connected (FC), 5 max-pooling, and 1 softmax layer, as shown in Figure 4. The size of the input photo is 224 × 224 × 3, where 3 is the number of channels (i.e., RGB), and each CCGP is resized to 224 × 224.
Recent transfer learning studies have demonstrated that CNN trained on one large dataset can be generalized to extract CNN features for other datasets, and outperform the state-of-the-art approaches on these new datasets for different tasks [37,38]. Therefore, we use pre-trained to initialize the weights of VGG-16 on the place database. Let f l index the l th convolutional layer, v l the number of filters in the l th convolution layer, z l be the spatial size of the filter, and m l be the spatial size of the output feature map. The updating of W is dominated by the computation of the convolution layer, and the time complexity for one input cost is O f l z 2 l v l m 2 l [39]. We remove the last FC and softmax layers, which are used for classification purposes, and take the output of the second FC layer (i.e., FC7) as the representation of a CCGP, i.e., r j . convolution, 3 fully connected (FC), 5 max-pooling, and 1 softmax layer, as shown in Figure 4. The size of the input photo is 224 × 224 × 3, where 3 is the number of channels (i.e., RGB), and each CCGP is resized to 224 × 224. Recent transfer learning studies have demonstrated that CNN trained on one large dataset can be generalized to extract CNN features for other datasets, and outperform the state-of-the-art approaches on these new datasets for different tasks [36,37]. Therefore, we use pre-trained to initialize the weights of VGG-16 on the place database. Let index the convolutional layer, the number of filters in the convolution layer, be the spatial size of the filter, and be the spatial size of the output feature map. The updating of is dominated by the computation of the convolution layer, and the time complexity for one input cost is ( ) [38]. We remove the last FC and softmax layers, which are used for classification purposes, and take the output of the second FC layer (i.e., FC7) as the representation of a CCGP, i.e., .

Estimating Latent Factor from Photos
Estimating latent factors for a given travel location from the corresponding CCGPs is a regression problem. Since latent factors are real-valued, the core objective is to minimize the mean square error of the estimations. Let be the latent factor vector of location , which is obtained by WMF, and is the corresponding prediction by CNN. Then, the minimization problem is presented as follows:

Estimating Latent Factor from Photos
Estimating latent factors for a given travel location from the corresponding CCGPs is a regression problem. Since latent factors are real-valued, the core objective is to minimize the mean square error of the estimations. Let l j be the latent factor vector of location j, which is obtained by WMF, and l j is the corresponding prediction by CNN. Then, the minimization problem is presented as follows:

Travel Location Recommendation
The framework representation of CNNMF is shown in Figure 2. By fusing Equation (8) and Equation (9), the objective function of CNNMF can be written as follows: where λ 3 and λ 4 are parameters used to control the estimation of latent factor and regularization terms. Equations (11) to (13), which are based on gradient descent, are used to update user u i and travel location l j , respectively.

The Learning Algorithm of CNNMF
With the above-mentioned update rules, the algorithm of CNNMF is summarized in Algorithm 1. The proposed CNNMF framework uses similarity weight between users (and travel locations) to exploit auxiliary attributes, which are integrated into the WMF process to recommend travel locations. For a new travel location, we initialize the weights of VGG16 by the pre-trained weights on the place database for photo classification. The place database is a very large photo dataset containing 7,076,580 photos from 476 scene categories. This is demonstrated by initializing CNN using pre-trained weights on place database. In practice, we keep the earlier layers fixed. This is motivated by the observation that the earlier features of a CNN contain more generic features that should be useful to many tasks, but later layers of the CNN become progressively more specific to the details of the original dataset and should be useful for travel location recommendation. In summary, all user and travel location latent factors are updated in O k 2 n P + k 3 N + k 3 M where n P is the number of observed ratings. Note that photo latent vectors are computed while updating W. Time complexity for updating W is dominated by the computation of the convolution layer, and thus all weight and bias variables of CNN are updated in O d l v l−1 .z 2 l .v l .m 2 l [39]. The total time complexity per epoch is l .v l .m 2 l , and this optimization process scales linearly with the size of the given data. Finally, we compute the score u T i l j , and recommend the travel locations with the highest scores.

Algorithm 1. The proposed Framework CNNMF
Input: P, user-travel location preference matrix Sim u (u i , u k ): user-user similarities Sim l (l j ,l k ):travel location-travel location similarities P lj for l j ∈ L Output: Latent factor vector of user and travel location u i , l j ; 1: initialize the weight of VGG-16 on the place database 2: initialize u i , l j and l j 3: for each u i do 4: update by Equation (10)

Experiments
In this section, we are setting experiments to evaluate the performance of the proposed method. We begin by introducing the dataset, parameter settings, the impact of topic number, and the impact of diverse types of information, followed by comparing the proposed method with the state-of-the-art travel location recommendation methods.

Dataset
We employ the CCGP dataset D used by Jiang et al. [18], which contains uploads from 7387 users. The dataset consists of photo albums associated with past check-ins, which are taken in nine popular tour cities (i.e., New York, Los Angeles, Chicago, Barcelona, Berlin, London, Paris, Rome, and San Francisco). We removed photos without latitude and longitude, as well as "selfie photos", as these photos cannot give enough information about travel locations. The final statistics of the dataset are shown in Table 2, and the spatial distribution of photos in popular tour cities is shown in Figure 5.

Experiments
In this section, we are setting experiments to evaluate the performance of the proposed method. We begin by introducing the dataset, parameter settings, the impact of topic number, and the impact of diverse types of information, followed by comparing the proposed method with the state-of-the-art travel location recommendation methods.

Dataset
We employ the CCGP dataset used by Jiang et al. [39], which contains uploads from 7,387 users. The dataset consists of photo albums associated with past check-ins, which are taken in nine popular tour cities (i.e., New York, Los Angeles, Chicago, Barcelona, Berlin, London, Paris, Rome, and San Francisco). We removed photos without latitude and longitude, as well as "selfie photos", as these photos cannot give enough information about travel locations. The final statistics of the dataset are shown in Table 2, and the spatial distribution of photos in popular tour cities is shown in Figure 5.

Parameter Settings
In this section, we provide the setting of several parameters utilized in our experiments.

•
To obtain the user-travel location interaction information, we empirically set a threshold of visit duration visit thr = 6 h.

•
CNN is employed to estimate the latent factors of new travel locations. The learning rate parameter is 0.001 for 60 epochs and mini-batch size is 128. The momentum is 0.9. The weight decay is 0.0005. The weights are randomly initialized following previous work [40].
In the following experiments, according to visiting time, we split the dataset D into the training set D train (80%) and the test set D test (20%). Then, use evaluation metrics, i.e., MAP@n and AP@n, were adopted to evaluate the recommendation effectiveness by calculated Equations (13) and (14), respectively.
where n indicates the number of recommended travel locations, and m represents the number of users. The relevance value lik j = 1 if the user has visited the travel location; otherwise, lik j = 0.

The Impact of Topic Number
The number of topics k is a significant parameter and has an impact on recommendation performance. To decide on an optimal number of topics, we conduct an experiment to study its impact. The result is shown in Figure 6, from which we can find that the MAP is up to 30.7% when k is 9. Thereby, k was set to 9 in the following experiments. parameter is 0.001 for 60 epochs and mini-batch size is 128. The momentum is 0.9. The weight decay is 0.0005. The weights are randomly initialized following previous work [40].
In the following experiments, according to visiting time, we split the dataset into the training set (80%) and the test set (20%). Then, use evaluation metrics, i.e., MAP@n and AP@n, were adopted to evaluate the recommendation effectiveness by calculated Equation (13) and (14), respectively.
where indicates the number of recommended travel locations, and represents the number of users. The relevance value = 1 if the user has visited the travel location; otherwise, = 0.

The Impact of Topic Number
The number of topics is a significant parameter and has an impact on recommendation performance. To decide on an optimal number of topics, we conduct an experiment to study its impact. The result is shown in Figure 6, from which we can find that the MAP is up to 30.7% when is 9. Thereby, was set to 9 in the following experiments.

The Impact of Diverse Types of Information
To explore the impact of diverse types of information on travel location recommendation, we set , = 0, causing the CNNMF framework to boil down as in Equation (7), then eliminate "time", "weather", "season", "tags", and "geographical distance" information, respectively. The results are given in Table 3, which can find the following tendencies:

The Impact of Diverse Types of Information
To explore the impact of diverse types of information on travel location recommendation, we set λ 3 , and λ 4 = 0, causing the CNNMF framework to boil down as in Equation (7), then eliminate "time", "weather", "season", "tags", and "geographical distance" information, respectively. The results are given in Table 3, which can find the following tendencies: • Diverse types of information enhance recommendation performance to diverse degrees. According to influence degree, the information can be ranked as follows: season information > weather information > text information > time information > geographical distance information. The performance of eliminating "season" information is the lowest, which means that "season" information is the most important information to recommend travel location. The performance of eliminating "geographical distance" information is the highest, which means that "geographical distance" information is the most unimportant information, as most travel locations are not far from each other.

•
The MAP of the proposed method is significantly better than those of the five other variants, which demonstrated that the proposed method integrates contextual, textual, and geographical information together and can thus provide improved recommendations.

The Performance Comparison of Recommendation Methods
To investigate the capability of the proposed method to recommend travel locations, we compared it with the following representative methods.

•
Dynamic topic model and matrix factorization (DTMMF): DTMMF integrates topic model with matrix factorization to recommend travel locations. DTM is used to obtain implicit information, while explicit information is obtained from past check-ins and visual contents (i.e., age and gender) to construct user and travel location profiles [9]. • Neural network-based Collaborative Filtering (NCF): NCF combines matrix factorization with multi-layer perceptron to capture nonlinear user-travel location interactions [41]. Visual content is not considered. • Visual-enhanced probabilistic matrix factorization model (VPMF): VPMF uses visual features to learn user preferences by leveraging the past check-ins of users. Then, it integrates user preferences with travel location constraints for trip planning [42]. • Visual Bayesian personalized ranking (VBPR): VBPR extracts the visual features from photos using a pre-trained method without any context information [5]. The extracted visual feature is used to predict the scores of people's opinions.

•
Visual Content Enhanced POI recommendation (VPOI): VPOI uses joint learning of photo classification, matrix decomposition, and visual feature extraction tasks [29], to recommend travel locations to the user. The difference with the proposed method is that VPOI uses photos for joint learning of the latent factor vector representations.
For fairness, all representative methods include the same total number of dimensions. The results are given in Table 4, and the following observations can be found:
• VPOI works better than VBPR, which might be because VPOI models photos for both users and travel locations while VBPR only models photos for travel locations.

•
The proposed method significantly outperforms VPOI. That is because of the incorporating of contextual (i.e., time, weather, and season), textual (i.e., tags), and geographical (i.e., distance) information, while VPOI only uses photos for joint learning of the latent factor vector representations. We compare our framework with the same above-mentioned representative methods for addressing the travel location cold start problem. We randomly select 5% of the travel locations' photos from the training set and remove their check-ins. Moreover, we remove the photo albums from the remaining 20% and use them as the testing set. All photo albums are associated with check-ins in our Flickr dataset. These travel locations (5%) will still have photos without any check-ins, which can help mitigate the travel location cold start problem. The results are given in Table 5, and the following observations can be found:

•
In general, the performance of all methods drops when we present the travel location cold start problem. For example, the performance of DTMMF decreases up to 14.65% in terms of MAP@10.

•
The performance reduction of VBPR is much smaller than that of DTMMF, as VBPR learns an additional layer to exploit the visual dimensions, which can help to alleviate the travel location cold start problem, while DTMMF uses visual contents only to extract attributes (i.e., age and gender) based on face recognition.

•
The proposed method of CNNMF significantly outperforms VPOI, while both methods use visual contents. The differences between the CNNMF and VPOI include: CNNMF directly obtains latent factor from its photos as descriptions of travel locations; while VPOI uses photos to help learn the latent factor vector representation.

Conclusions and Future Work
In this study, we propose a CNNMF that integrates CNN and WMF to obtain the latent factor representations for new travel locations. We use similarity weight between users (and travel locations) to exploit auxiliary attributes, which are integrated into the WMF process to recommend travel locations.
If a travel location does not have past check-ins, the proposed method uses CNN to estimate its latent factor representation from its photos. Experimental results demonstrate that CNNMF significantly outperforms existing methods. Future research can extend in the following directions: (i) extract more information from photos (e.g., social correlations), which can help to mitigate the user cold start problem; (ii) incorporate other competitive recommendation methods.
Author Contributions: Conceptualization and methodology, Thaair Ameen and Ling Chen; validation and formal analysis, Thaair Ameen, Ling Chen, Zhenxing Xu, Dandan Lyu, and Hongyu Shi; software, Thaair Ameen, Zhenxing Xu; writing-original draft preparation, Thaair Ameen, and Ling Chen; writing-review and editing, Thaair Ameen, Ling Chen. All authors have read and agreed to the published version of the manuscript.