Transfer Learning of a Deep Learning Model for Exploring Tourists’ Urban Image Using Geotagged Photos

: Recently, as computer vision and image processing technologies have rapidly advanced in the artiﬁcial intelligence (AI) ﬁeld, deep learning technologies have been applied in the ﬁeld of urban and regional study through transfer learning. In the tourism ﬁeld, studies are emerging to analyze the tourists’ urban image by identifying the visual content of photos. However, previous studies have limitations in properly reﬂecting unique landscape, cultural characteristics, and traditional elements of the region that are prominent in tourism. With the purpose of going beyond these limitations of previous studies, we crawled 168,216 Flickr photos, created 75 scenes and 13 categories as a tourist’ photo classiﬁcation by analyzing the characteristics of photos posted by tourists and developed a deep learning model by continuously re-training the Inception-v3 model. The ﬁnal model shows high accuracy of 85.77% for the Top 1 and 95.69% for the Top 5. The ﬁnal model was applied to the entire dataset to analyze the regions of attraction and the tourists’ urban image in Seoul. We found that tourists feel attracted to Seoul where the modern features such as skyscrapers and uniquely designed architectures and traditional features such as palaces and cultural elements are mixed together in the city. This work demonstrates a tourist photo classiﬁcation suitable for local characteristics and the process of re-training a deep learning model to effectively classify a large volume of tourists’ photos.


Introduction
Today, people share ideas, photos, videos, and posts with others; maintain their social relationships; and find news and information through social network service (SNS). As the number of users connected to the SNS platform has increased exponentially, SNS is being utilized as a major source of data in various fields. In particular, user-generated contents in SNS are recognized as a major source of data in grasping the urban image that tourists feel about [1][2][3].
Among SNS data, Flickr, which aims to share photos with users, has been used in various studies as it not only includes location and time information in the metadata of photo but also is open to the public. Using Flickr data, studies such as analysis of region of attraction [4,5], analysis of city image and emotion [6][7][8], and analysis of locationbased recommendation system [9][10][11] have been conducted. However, these studies have a limitation to analyze the visual content of the photo due to the lack of methodology and technique.
As a photo is evaluated as reflecting the photographer's inner feelings, Pan et al. analyzed 145 tourist photos posted in The New York Times and revealed that the landscape contained in the photo is linked to the urban image that tourists feel about [12]. Donaire et al. recognized that a photo plays an important role in the formation of tourism images [13]. They classified tourists into four groups and identified favorite regions of attractions by group through the analysis of 1786 photos downloaded from Flickr. This conventional ISPRS Int. J. Geo-Inf. 2021, 10, 137 2 of 20 way of analyzing photos identifies the visual contents manually and uses the text attached on the photos as an auxiliary means. The conventional way has the advantage that it provides a conceptualized framework in the theoretical aspect, but it has the disadvantage that the number of photos is limited and artificial category classification is unstable and irregular [14].
Recently, as computer vision and image processing technologies have rapidly advanced in the AI field, techniques for analyzing visual contents in photo are also increasingly evolving in the field of urban study. Visual content analysis of photos using AI technology has the advantage of being able to quickly classify a large volume of photos into a standardized classification processing. As the convolutional neural network (CNN), one of artificial neural networks, shows high performance in image identification and classification, it is applied widely in the research of analyzing the visual content of the photos. Representative architectures of CNN include AlexNet [15], GoogLeNet [16], ResNet [17], etc. In particular, in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), AlexNet showed more than 10% better performance than the existing image recognition models. After that, deep CNN models such as VGGNet [18], DenseNet [19], MobileNet [20], etc. are evolving rapidly.
As CNN models show excellent performance in image recognition, the application of these models to other areas through transfer learning continues to surge. Transfer learning is fine-tuning of CNNs pre-trained on a large annotated image dataset to other domains/tasks [21]. In the field of urban studies, especially in the tourism area, studies to classify tourist photos based on CNN model have begun [14,[22][23][24]. However, these studies are limited in reflecting the unique landscape or regional characteristics in the area.
With the purpose of overcoming these limitations of previous studies, this study aimed to apply computer vision and image processing technique to effectively classify a large volume of Flickr photos uploaded by tourists. This study had three objectives: (1) develop a tourism photo classification by analyzing the characteristics of photos; (2) propose detailed procedures of training a deep learning model to enhance the model accuracy; and (3) analyze the urban images of tourists visiting Korea by applying the final model to the entire dataset.

Literature Review
In the field of computer vision, studies for image analysis have begun classifying images by assigning a single label to an image. Recently, image analysis has been developed into object detection to extract a specific object from an image [25,26], image captioning to generate textual description of an image [27], and multi-label classification to assign multiple labels to a single image [28].
A labeled dataset is required to train a model in deep learning-based image classification. The ImageNet, a representative database used for deep learning, contains 14,197,122 images labeled with 1000 categories. ImageNet assigns a single label to an object. The performance of deep learning models such as AlexNet, VGG Net, and ResNet is evaluated based on ImageNet dataset. In addition to the ImageNet dataset, SUN [29] and Places365 [30] are datasets that systematically classify scenes. The SUN dataset includes 108,754 images, with 397 scene semantic categories. The Places365 dataset includes 10M images, with 434 scene semantic categories. Recently, as the need to assign multiple tags to a scene has been recognized, Tencent's multi-label image dataset [28] was released. Scene classification dataset acquired from remote sensing [31] and Place Pulse dataset [32], which evaluates the emotions of urban built environment through street images, were released. In addition to these labeled data, street level images, such as Google Street View (GSV) and Tencent Street View (TSV), and geotagged photos from online photo sharing services, such as Flickr and Panoramio, have become major sources of data for urban studies.
Studies where the CNN model is applied in urban and tourism areas can be divided into two approaches: using a pre-trained model as is and using a re-trained model through transfer learning. Studies that apply the pre-trained model in urban area identify optimal location or evaluate street environments through density analysis after detecting specific objects using an object detection model [33,34]; identify crime scenes or analyze the visual appearance of cities using image segmentation model [35,36]; or cluster or regroup classification results after applying a pre-trained image classification model [37,38].
In addition, several studies have been conducted to analyze the tourist's urban image by applying the pre-trained model in the tourism area. Chen et al. classified Flickr photos using the ResNet model trained on Places365 dataset and analyzed regions of interest and seasonal dynamics to identify the difference between urban and non-urban areas of London [39]. Payntar et al. analyzed which photos were mainly taken in the World Heritage site of Cuzco, Peru, using the ResNet50 model trained on Places365 dataset [23]. Kim et al. analyzed Seoul tourism images by classifying Flickr photos into 1000 categories using the Inception-v3 model trained on ImageNet dataset [24]. These studies, however, have a limitation on reflecting the local characteristics when the pre-trained models were applied to specific regions. Chen et al. pointed out that the ResNet model pre-trained on Places365 dataset could misclassify Flickr data [39]. Payntar et al. also presented that the pre-trained ResNet model on Places365 dataset had a problem of not reflecting regional characteristics when classifying scenes in cultural heritage regions [23]. In particular, Kim et al. proposed the necessity of creating a photo classification and re-training the model because, when using the pre-trained model, Korean detached houses were misclassified as prisons, and Korean traditional buildings and unusual landscapes were also misclassified. They pointed out that the overall accuracy was only 27.93% when checking the predicted label with "true" or "false" after classifying 38,891 photos [24].
Studies that apply a re-trained model through transfer learning in urban area build a model that predicts human perception of a city, such as scenicness, safety, and quality [40][41][42]; construct a fusion model that predicts the relative evaluation score after learning the features of each image using two networks instead of one network [32,[43][44][45][46]; or modify the classifier part of the model while freezing the convolutional part that extracts the features of the image [47,48]. In addition, a few studies have been conducted to analyze the tourist's urban image by modifying the classifier part of the CNN model in the tourism domain. Zhang, Chen, and Li analyzed the images of tourist attractions using Flickr photos with the Resnet-101 model trained on Places365 dataset, which classified images into 434 scenes [14,22]. They modified the classifier part of the model by regrouping 434 scenes into 103 scenes and applied the model to the cities of Beijing and Hong Kong.
These studies, however, have a limitation when both the pre-trained model as is and the re-trained model through transfer learning are applied in tourism domain. In tourism, the unique scenery, cultural properties, and experience activities of the region are the key to the formation of the tourists' urban image, but these studies are not able to properly identify tourism elements or regional characteristics. To analyze tourists' urban image through photos, it is necessary to create a tourism photo classification in consideration of the unique landscape and cultural characteristics of the region. Thus, in this study, we built a tourists' photo classification by analyzing the characteristics of photos posted by tourists and referring to the tourism classification of the Tourism Organization. In addition, we developed a deep learning model to classify a large volume of photos effectively and consistently according to classification criteria.

Research Process
The research flow of this study is shown in Figure 1. First, the photos on Flickr were crawled and divided into photos uploaded by tourists and residents, respectively. Second, tourists' photo classification was developed by analyzing the characteristics of photos posted by tourists and referring to the tourism classification of the Tourism Organization. Third, a deep learning model was developed by continuously re-training the Inception-v3 model. Lastly, the final model was applied to the entire dataset to analyze regions of attraction and tourists' urban image in Korea. crawled and divided into photos uploaded by tourists and residents, respectively. Second, tourists' photo classification was developed by analyzing the characteristics of photos posted by tourists and referring to the tourism classification of the Tourism Organization. Third, a deep learning model was developed by continuously re-training the Inception-v3 model. Lastly, the final model was applied to the entire dataset to analyze regions of attraction and tourists' urban image in Korea.

Data Collection and Tourist Identification
Photos on Flickr were collected through a public application programming interface (API) provided by Flickr. The photo collection period was six years from 1 January, 2013 to 31 December, 2018, and photos uploaded within Korea were crawled. In total, 284,094 photos were collected, and the number of users was 5609. Since residents and tourists are mingled among Flickr users, it is necessary to identify tourists by excluding residents. To track down each user's country of residence, photos uploaded by users around the world were crawled over the previous three years from the time the photo was last uploaded. In total, 2,281,800 initial photos were collected worldwide, and the number of users was 5609. After removing the posts deleted by the user or the data with latitude, longitude, and temporal errors, 2,281,586 photos were finally collected, and the number of users was 5384. Of the total 5384 users, 2042 users entered their owner location in their profiles, and 3342 users did not provide their owner location. For the 3342 users, tourists were extracted by tracking down the country of residence by calculating the date of stay in a specific country, frequency of visit, and date of stay in Korea [49]. As a result of identification, 3259 users were determined as tourists, and 168,216 photos were extracted.

Data Collection and Tourist Identification
Photos on Flickr were collected through a public application programming interface (API) provided by Flickr. The photo collection period was six years from 1 January 2013 to 31 December 2018, and photos uploaded within Korea were crawled. In total, 284,094 photos were collected, and the number of users was 5609. Since residents and tourists are mingled among Flickr users, it is necessary to identify tourists by excluding residents. To track down each user's country of residence, photos uploaded by users around the world were crawled over the previous three years from the time the photo was last uploaded. In total, 2,281,800 initial photos were collected worldwide, and the number of users was 5609. After removing the posts deleted by the user or the data with latitude, longitude, and temporal errors, 2,281,586 photos were finally collected, and the number of users was 5384. Of the total 5384 users, 2042 users entered their owner location in their profiles, and 3342 users did not provide their owner location. For the 3342 users, tourists were extracted by tracking down the country of residence by calculating the date of stay in a specific country, frequency of visit, and date of stay in Korea [49]. As a result of identification, 3259 users were determined as tourists, and 168,216 photos were extracted.

Classification of Tourists' Photos
To classify tourists' photos, the survey of the Korea Tourism Organization and the tourism category of the tourism application were referenced. In addition, after manually labeling 30,000 photos (20% of Flickr photos), the characteristics of tourists' photos were identified. Through this process, a draft of tourists' photo classification was developed and updated by running Inception-v3 model, repetitively. Due to the nature of tourists' photos, it was necessary to segregate selfie photos that occurred frequently in tourism as well as indistinct photos that were difficult to classify such as blurred or enlarged photos.
Through the process of refining photo classification, the tourists' photos were classified into 75 scenes, including "difficult to classify". Then, the 74 scenes were grouped into  Table 1. In Table 1, the 75 scenes are divided into the scenes with strong local characteristics, scenes in which local and general characteristics are mixed, and common scenes that can be applied in any region. There are 35 scenes with strong local and local/general characteristics, representing about 47% of the 75 scenes. Scenes with strong local characteristics are Korean palaces, street food, traditional markets, hanbok experience, traditional performances, etc.

Training a Deep Learning Model for Classifying Tourists' Photos
We aimed to develop a deep learning model through transfer learning of Inception-v3 model, which is one of the well-known pre-trained CNN architectures. CNN is one of the deep neural networks, which is the essential technology leading the state-of-the-art in computer vision for a variety of tasks. Although several models have been released thus far, Inception-v3 is still one of the most accurate models in its field for image classification, achieving Top 5 error accuracy of 3.58% and Top 1 error accuracy of 17.3% when trained on ImageNet dataset. In our work, the original network architecture of Inception-v3 was maintained and the pre-trained weights by ImageNet were used to initialize the network. With the process of fine-tuning, the initialized weights were subsequently updated so that the network could learn the specific features of the new task. The model was modified so that it can classify photos into 75 scenes in the last softmax layer, as shown in Figure 2.
far, Inception-v3 is still one of the most accurate models in its field for image classification, achieving Top 5 error accuracy of 3.58% and Top 1 error accuracy of 17.3% when trained on ImageNet dataset. In our work, the original network architecture of Inception-v3 was maintained and the pre-trained weights by ImageNet were used to initialize the network. With the process of fine-tuning, the initialized weights were subsequently updated so that the network could learn the specific features of the new task. The model was modified so that it can classify photos into 75 scenes in the last softmax layer, as shown in Figure 2. In the case of CNN, thousands of parameters have to be trained, so there is a risk of overfitting when training the model with a limited number of training data. The most common way to reduce overfitting is to use a data augmentation technique that artificially increases the training dataset. Data augmentation is a technique that creates a similar but new image by slightly modifying the input image. One can create a new image by applying techniques such as panning, zooming, rotating, brightness adjustment, horizontal flip, vertical flip, and shearing. Through this, an N-size dataset can be increased to a size of 2N, 3N, 4N, etc. [50,51].
The accuracy of the re-trained model was evaluated by calculating accuracy, recall, precision, and F1-score after constructing a confusion matrix [52], as shown in Figure 3. The accuracy was calculated for 75 total scenes, whereas the recall and precision were calculated for each of the 75 scenes. Accuracy refers to the ratio of the true to the predicted values matched in the total classification results. Recall means the ratio of the correctly predicted value to the true value in a corresponding scene. Precision represents the ratio matched with the true value to the predicted value in a corresponding scene. Recall and precision in the confusion matrix can be used in a complementary way. The higher these two indices are, the better the model is. Recall and precision have a trade-off relationship. Thus, the F1-score, which is the harmonic mean of the recall and the precision, is used to evaluate the model performance. In the case of CNN, thousands of parameters have to be trained, so there is a risk of overfitting when training the model with a limited number of training data. The most common way to reduce overfitting is to use a data augmentation technique that artificially increases the training dataset. Data augmentation is a technique that creates a similar but new image by slightly modifying the input image. One can create a new image by applying techniques such as panning, zooming, rotating, brightness adjustment, horizontal flip, vertical flip, and shearing. Through this, an N-size dataset can be increased to a size of 2N, 3N, 4N, etc. [50,51].
The accuracy of the re-trained model was evaluated by calculating accuracy, recall, precision, and F1-score after constructing a confusion matrix [52], as shown in Figure 3. The accuracy was calculated for 75 total scenes, whereas the recall and precision were calculated for each of the 75 scenes. Accuracy refers to the ratio of the true to the predicted values matched in the total classification results. Recall means the ratio of the correctly predicted value to the true value in a corresponding scene. Precision represents the ratio matched with the true value to the predicted value in a corresponding scene. Recall and precision in the confusion matrix can be used in a complementary way. The higher these two indices are, the better the model is. Recall and precision have a trade-off relationship. Thus, the F1-score, which is the harmonic mean of the recall and the precision, is used to evaluate the model performance.  = TP A +TP B +TP C TP A +TP B +TP C +E AB +E AC +E BA +E BC +E CA +E CB (1)

Spatial Analysis of Tourists' Photos
After selecting the final model from the experiment, we classified the 168,216 photos into 75 scenes and identified the characteristics of tourist visits to Korea by analyzing the dense areas of tourist photos. Two methods are used to analyze dense regions from the tourists' photos. The first method is Kernel density estimation, which is one of the methods that can effectively represent the point distribution pattern in space as a method of measuring the density from the characteristics of data in the study area [53]. The Kernel function is expressed as K by measuring the density of point data included in a certain radius (bandwidth). In this study, the analysis radius was set to 1 km, and the output grid size was set to 110 m × 110 m. This is expressed in Equation (5): where f(x): ker = 1 is the function estimate, n is the number of points, h is the bandwidth, d is the data dimensionality, x is an unknown point, and x i is the ith observation point.
The second method is the density-based spatial clustering of applications with noise (DBSCAN), which is used to analyze a specific dense region of a travel category in detail. The DBSCAN algorithm forms the clusters based on the density of data and receives the critical distance eps and the minimum number of data minPts for cluster formation. The core concept of the algorithm is that data form a cluster if the number of data points is more than minPts within the threshold distance eps [54]. To apply DBSCAN, it is necessary to determine the adjacent radius eps and the density threshold value minPts. Therefore, it is important to find appropriate parameter values because the cluster type is different depending on them when forming a cluster. The experiment was conducted with combinations within the range of 150-300 for minPts and 150-500 m for eps. After evaluating whether major tourism destinations are correctly formed or not, the cluster was derived after setting eps to 300 m and minPts to 200 as optimal values.

Setup
In this study, Python 3.6.5 was used for data collection, and anaconda 3.5.2 and Tensorflow 1.13.0 were used for transfer learning of the model and image classification. The experimental environment for model training and photo classification was the p3.16xlarge specification provided by Amazon Web Service (OS is Ubuntu 16.04, GPU is NVIDIA Telesa V100 128 GB 8ea, vCPU 64, RAM 488GB). Qgis 3.6, ArcPro 2.20, and Python 3.6.5 were applied as GIS programs for spatial analysis of photo data.
Labeling the photo is a crucial process in building a training dataset and evaluating model accuracy. We used 168,216 photos in this study, with 60% used in the training phase. It was quite challenging to build a training dataset containing 100,000 photos by labeling them consistently. In this study, we used a semi-supervised labeling method in which a model was built using a small number of labeled data, and then applied to a new dataset to label data automatically [55,56]. This method was suitable when building a training dataset based on a large number of unlabeled data and a small number of labeled data. Around 20% of the training data were labeled manually to train the model, and the trained model was applied to a new training dataset to label it automatically. For automatically assigned labels, true or false was checked directly with human eyes. If it turned out to be false, a true label was manually attached.

Transfer Learning of Inception-v3 Model
All photos were divided into 60% training data, 20% validation data, and 20% test data to train and evaluate the model. First, we selected representative photos for each of the 75 scenes to build a training dataset. We built the first training dataset by extracting 50 photos in each scene after manually labeling 20% of the training dataset. After training the model based on 50 photos per scene, it was applied to other training datasets to check the accuracy of the model. From this, the number of photos in the training dataset was gradually increased to improve the accuracy of the model. The training dataset per scene started from 50 photos and increased to 300 photos per scene, as shown in Figure 4. As the number of photos per scene increased, accuracy improved from 66.99% to 84.23%, as shown in Table 2. The overall accuracy was no longer improved but was similar when the number of photos increased from 200 to 300 in the training dataset. However, the deviation of the accuracy among scenes was smaller when the number of photos was 300 per scene in the training dataset. Thus, 300 photos per scene were applied to the training dataset. A representative photo of each scene is illustrated in Figure 5.
Several considerations exist when building a training dataset. First, we built the training dataset with only photos that clearly contained the characteristics of each scene and used the part of the photo that showed the features of the scene rather than the entire photo, if needed. Second, we cross-checked the photos by scene so that similar photos were not included in different scenes. Third, we equalized the number of photos per scene although the number of photos per scene varied. Fourth, the photos released from open data such as Google photo were added when it was difficult to find representative photos from collected data. Fifth, the indistinct photos were classified as "difficult to classify" scene, as shown in Figure 6.       An experiment was conducted to determine whether data augmentation was necessary to improve the model performance after setting the number of training photos to 300 per scene. The data augmentation-related experiment aimed to review which effects could be used to increase the number of photos and how many times the number of photos would increase. Zooming, rotation, brightness, horizontal flip, and width shift were used as photo effects. In this study, zooming was set to 0.85~1.15, rotation to 10, brightness to 0.5~1.5, horizontal flip to true, and width shift to 0.15. An example of photo effects is shown in Figure 7. Regarding data augmentation, classification accuracy was confirmed while gradually increasing the number of photos, as shown in Table 3. Case 1 was created with the original training dataset, 22,384 photos, without applying data augmentation. Cases 2-5 were created by increasing the number of original training dataset by 2-5 times, respectively. The hyper-parameters used in the model were set to Adam for the optimizer, 0.0001 for the learning rate, and 128 for the batch size. As shown in Table 3, classification accuracy was improved as the number of photos was increased. Classification accuracy by case was evaluated with the validation data, as shown in Table 4. For accuracy evaluation, 33,643 photos were labeled as validation dataset, and the Top 1 accuracy, Top 5 accuracy, recall, precision, and F1-scores were calculated. The Top 1 accuracy is the accuracy where the most probable label predicted by the model matches with the true label. The Top 5 accuracy is the accuracy where any one of the five most probable labels predicted by the model matches with the true label. As for the Top 1 accuracy, Case 1 without data augmentation was the highest at 73.51%, and for recall An experiment was conducted to determine whether data augmentation was necessary to improve the model performance after setting the number of training photos to 300 per scene. The data augmentation-related experiment aimed to review which effects could be used to increase the number of photos and how many times the number of photos would increase. Zooming, rotation, brightness, horizontal flip, and width shift were used as photo effects. In this study, zooming was set to 0.85~1.15, rotation to 10, brightness to 0.5~1.5, horizontal flip to true, and width shift to 0.15. An example of photo effects is shown in Figure 7. An experiment was conducted to determine whether data augmentation was necessary to improve the model performance after setting the number of training photos to 300 per scene. The data augmentation-related experiment aimed to review which effects could be used to increase the number of photos and how many times the number of photos would increase. Zooming, rotation, brightness, horizontal flip, and width shift were used as photo effects. In this study, zooming was set to 0.85~1.15, rotation to 10, brightness to 0.5~1.5, horizontal flip to true, and width shift to 0.15. An example of photo effects is shown in Figure 7. Regarding data augmentation, classification accuracy was confirmed while gradually increasing the number of photos, as shown in Table 3. Case 1 was created with the original training dataset, 22,384 photos, without applying data augmentation. Cases 2-5 were created by increasing the number of original training dataset by 2-5 times, respectively. The hyper-parameters used in the model were set to Adam for the optimizer, 0.0001 for the learning rate, and 128 for the batch size. As shown in Table 3, classification accuracy was improved as the number of photos was increased. Classification accuracy by case was evaluated with the validation data, as shown in Table 4. For accuracy evaluation, 33,643 photos were labeled as validation dataset, and the Top 1 accuracy, Top 5 accuracy, recall, precision, and F1-scores were calculated. The Top 1 accuracy is the accuracy where the most probable label predicted by the model matches with the true label. The Top 5 accuracy is the accuracy where any one of the five most probable labels predicted by the model matches with the true label. As for the Top 1 accuracy, Case 1 without data augmentation was the highest at 73.51%, and for recall Regarding data augmentation, classification accuracy was confirmed while gradually increasing the number of photos, as shown in Table 3. Case 1 was created with the original training dataset, 22,384 photos, without applying data augmentation. Cases 2-5 were created by increasing the number of original training dataset by 2-5 times, respectively. The hyper-parameters used in the model were set to Adam for the optimizer, 0.0001 for the learning rate, and 128 for the batch size. As shown in Table 3, classification accuracy was improved as the number of photos was increased. Classification accuracy by case was evaluated with the validation data, as shown in Table 4. For accuracy evaluation, 33,643 photos were labeled as validation dataset, and the Top 1 accuracy, Top 5 accuracy, recall, precision, and F1-scores were calculated. The Top 1 accuracy is the accuracy where the most probable label predicted by the model matches with the true label. The Top 5 accuracy is the accuracy where any one of the five most probable labels predicted by the model matches with the true label. As for the Top 1 accuracy, Case 1 without data augmentation was the highest at 73.51%, and for recall value, Case 5, which increased the number of original photos by five times, was the highest at 0.7631. On the other hand, the Top 5 accuracy, precision, and F1-score showed the best performance in Case 4, which increased the number of original photos by four times. Therefore, Case 4 was selected as the final model. For accuracy evaluation of the final model, 32,682 photos were used as the test dataset by removing the 510 photos in "difficult-to-classify" scene from the 33,192 total photos. The final model showed the Top 1 accuracy of 85.77%, Top 5 accuracy of 95.69%, and F1-score of 0.8485, as shown in Table 5. The performance of the final model was reasonably good, comparing it with the performance of Inception-v3 model on ImageNet dataset, which showed 82.7% for Top 1 accuracy and 96.42% for Top 5 accuracy. The training dataset for 75 scenes and source code of final model constructed in this study are publicly available on the website: https://github.com/ewha-gis/Korea-Tourists-Urban-Image (accessed on 28 December 2020).  Figure 8 shows the accuracy values in view of precision, recall, and F1-score by scene. The classification performance by scene showed that "bike" scene was highest at 0.9707, followed by "cat" scene at 0.9699, "eaves" at 0.9697, "airplane" at 0.9667, and "food" at 0.9488, based on F1-score. On the contrary, the scene of lowest performance was "amusement park" at 0.6056, followed by "lantern and altar" at 0.6431, "war memorial" at 0.6684, "lantern fireworks festival" at 0.7164, and "view" at 0.7285. These results indicate that the scenes that were clearly recognized by the object or highly differentiated from other scenes could be well classified, whereas the scenes with various objects could be somewhat poorly classified. ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 13 of 21 Figure 8. Precision, recall, and F1-score by scene.

Spatial Analysis of Tourists' Photos
By applying the final model to the entire data, the tourists' urban images were explored in more detail by narrowing down the scope of analysis from Korea to Seoul. Seoul is the capital and largest city in South Korea, mingling unique cultural heritage such as well-preserved royal palaces and Buddhist temples with modern landscapes such as skyscrapers, shopping malls, and K-pop entertainment. Major attractions in Seoul are shown in Figure 9. With respect to the volume of data, 2264 tourists, representing 69.5% of the total 3259 tourists, visited in Seoul, and 80,553 photos, which is 47.9% of the total 168,216 photos, were posted.
The results of classifying photos by applying the final model to the collected 80,553 photos are shown in Figures 10 and 11, which present the percentage of 74 scenes and 12 categories in descending order. The frequency of photos posted in Seoul by scene and category are as follows: "selfies and people", "food", "palace", "conference", and "building" by scene and "Urban scenery", "Korean traditional architecture", "Food and Beverage", "Shopping", and "Activities" by category. It can be seen that tourists prefer to take photos of selfies in exotic landscapes, enjoy local food, visit authentic traditional palaces, and see inherent cityscape which can be uniquely enjoyed in Seoul.

Spatial Analysis of Tourists' Photos
By applying the final model to the entire data, the tourists' urban images were explored in more detail by narrowing down the scope of analysis from Korea to Seoul. Seoul is the capital and largest city in South Korea, mingling unique cultural heritage such as well-preserved royal palaces and Buddhist temples with modern landscapes such as skyscrapers, shopping malls, and K-pop entertainment. Major attractions in Seoul are shown in Figure 9. With respect to the volume of data, 2264 tourists, representing 69.5% of the total 3259 tourists, visited in Seoul, and 80,553 photos, which is 47.9% of the total 168,216 photos, were posted.  The results of classifying photos by applying the final model to the collected 80,553 photos are shown in Figures 10 and 11, which present the percentage of 74 scenes and 12 categories in descending order. The frequency of photos posted in Seoul by scene and category are as follows: "selfies and people", "food", "palace", "conference", and "building" by scene and "Urban scenery", "Korean traditional architecture", "Food and Beverage", "Shopping", and "Activities" by category. It can be seen that tourists prefer to take photos of selfies in exotic landscapes, enjoy local food, visit authentic traditional palaces, and see inherent cityscape which can be uniquely enjoyed in Seoul.   The regions where many photos are posted can be recognized as attractive tourist destinations. Figure 12 shows a dot map and a kernel density map using the location information of the photos. Looking at the kernel density map, it can be seen that the photos posted by tourists are concentrated in the downtown area of Seoul. The regions where many photos are posted can be recognized as attractive tourist destinations. Figure 12 shows a dot map and a kernel density map using the location information of the photos. Looking at the kernel density map, it can be seen that the photos posted by tourists are concentrated in the downtown area of Seoul. The regions where many photos are posted can be recognized as attractive tourist destinations. Figure 12 shows a dot map and a kernel density map using the location information of the photos. Looking at the kernel density map, it can be seen that the photos posted by tourists are concentrated in the downtown area of Seoul. However, the clustered areas in Seoul appear differently by category. Figure 13 shows the kernel density map by grouping 74 scenes into 12 categories. The kernel density map can be classified into three types. The first type is a category in which one distinct core region appears and the spread to other regions is weak. The "shopping", "Korean traditional architecture", and "information and symbol" categories belong to this type. For example, Myeong-dong is a hot spot for shopping, while Gyeongbokgung Palace and Gwanghwamun Gate are the bustling places for Korean traditional architecture.
The second type is a type in which small dense areas are scattered in various places in addition to the city center. The "food and beverage", "people", "culture and relics", and "traffic" categories belong to this type. For the categories of "food and beverage" and "people", frequently visited regions at a small scale can be found in places around Shinchon-Hongdae, Itaewon, and Garosu-gil in Gangnam. For the "culture and relics" category, frequently visited places at a small scale are found in Namsan Tower, as well as Gyeongbokgung Palace, Gwanghwamun Gate, and Jongro, surrounding areas of the War However, the clustered areas in Seoul appear differently by category. Figure 13 shows the kernel density map by grouping 74 scenes into 12 categories. The kernel density map can be classified into three types. The first type is a category in which one distinct core region appears and the spread to other regions is weak. The "shopping", "Korean traditional architecture", and "information and symbol" categories belong to this type. For example, Myeong-dong is a hot spot for shopping, while Gyeongbokgung Palace and Gwanghwamun Gate are the bustling places for Korean traditional architecture.
The second type is a type in which small dense areas are scattered in various places in addition to the city center. The "food and beverage", "people", "culture and relics", and "traffic" categories belong to this type. For the categories of "food and beverage" and "people", frequently visited regions at a small scale can be found in places around Shinchon-Hongdae, Itaewon, and Garosu-gil in Gangnam. For the "culture and relics" category, frequently visited places at a small scale are found in Namsan Tower, as well as Gyeongbokgung Palace, Gwanghwamun Gate, and Jongro, surrounding areas of the War Memorial Museum, and the Jamsil area. For "traffic", frequently visited places are found at a small scale at Yongsan Station, Seoul Station, and many various places.
The third type is the type in which the denseness of the city center is relatively weak and the dense areas are somewhat dispersed. "Activities", "accommodations and conferences", "animals", and "natural landscapes" belong to this type. For "activities", frequently visited places are found in the Namsan Tower and Jamsil area, along with the Gyeongbokgung Palace and Gwanghwamun Gate area. For the "accommodation and conference" category, frequently visited places are found in the City Hall and surrounding area, Dongdaemun, Yeouido, and COEX. For the "animal" category, frequently visited places are the Children's Grand Park area and cat cafeterias in Gangseo-gu, in addition to urban places, such as Myeong-dong. For the "natural landscape" category, the most sporadic pattern of frequently visited places is found. Many photos are shot in the Gyeongbokgung and Changdeokgung areas and in parks surrounding Namsan Tower, Seoul Forest Park, Bukhansan, Dobongsan, and Gwanaksan.
Urban images through photos posted by tourists can be analyzed in more detail by applying DBSCAN method to a category, for example the "Activity" category. Figure 14 shows six regions of attractions and representative photos. Figure 15 shows the popular activities in six regions of attractions: a traditional performance of the guardianship rotation and a winter lantern festival at Seoul City Hall, a traditional performance and hanbok experience at Gyeongbokgung Palace, a lock of love at Namsan Seoul Tower, various stage performances including K-pop at Jamsil Sports Complex, a theme park at Lotte World, and a hanbok experience at Namdaemun Market.
Gyeongbokgung Palace and Gwanghwamun Gate area. For the "accommodation and conference" category, frequently visited places are found in the City Hall and surrounding area, Dongdaemun, Yeouido, and COEX. For the "animal" category, frequently visited places are the Children's Grand Park area and cat cafeterias in Gangseo-gu, in addition to urban places, such as Myeong-dong. For the "natural landscape" category, the most sporadic pattern of frequently visited places is found. Many photos are shot in the Gyeongbokgung and Changdeokgung areas and in parks surrounding Namsan Tower, Seoul Forest Park, Bukhansan, Dobongsan, and Gwanaksan. Urban images through photos posted by tourists can be analyzed in more detail by applying DBSCAN method to a category, for example the "Activity" category. Figure 14 shows six regions of attractions and representative photos. Figure 15 shows the popular activities in six regions of attractions: a traditional performance of the guardianship rotation and a winter lantern festival at Seoul City Hall, a traditional performance and hanbok experience at Gyeongbokgung Palace, a lock of love at Namsan Seoul Tower, various stage performances including K-pop at Jamsil Sports Complex, a theme park at Lotte World, and a hanbok experience at Namdaemun Market. Figure 14. Six regions of attractions for "Activities" category. Figure 14. Six regions of attractions for "Activities" category.

Discussion and Conclusions
In the tourism field, a few studies have emerged to analyze the tourists' urban image using pre-trained deep learning models such as ResNet or Inception-v3. When photos are

Discussion and Conclusions
In the tourism field, a few studies have emerged to analyze the tourists' urban image using pre-trained deep learning models such as ResNet or Inception-v3. When photos are classified using the ResNet model trained on Places365 dataset with 434 category or Inception-v3 model trained on ImageNet dataset with 1000 categories, the results of photo classification maintain the category of training dataset, which does not properly reflect regional characteristics. Kim et al. pointed out that the overall accuracy was only 27.93% when checking the predicted label with "true" or "false" after classifying 38,891 photos in Seoul using Inception-v3 model trained on ImageNet dataset [24]. Figure 16 shows how the tourism scenes in our study are classified as the scenes in Places365 on the website http:// places2.csail.mit.edu/index.html (accessed on 28 December 2020). The number represents the probability of being classified into that scene. Figure 16 shows that the "hanbok experience" scenes are wrongly classified as temple or water and photos taken in the "love lock" scenes as playground or shoe shop. These kinds of misclassifications are evident in the scenes that can be uniquely observed in Korea such as "street food", "traditional market", "traditional performance", "mural and trick art", " lantern fireworks festival", "lantern and alter", etc. Thus, it is essential to develop a tourists' photo classification suitable for local characteristics and classify photos accordingly.
This study has novelty in that it developed a tourist photo classification suitable for local characteristics and showed the process of re-training a deep learning model to effectively classify tourism photos. For tourists' photo classification, we labeled 30,000 photos (20% of Flickr photos) manually and analyzed the characteristic of photos by referring to the survey of the Korea Tourism Organization and the tourism category of the tourism application. A draft of tourists' photo classification was developed and updated by running Inception-v3 model, repetitively. Finally, through the comprehensive process of refining photo classification, the tourists' photos were classified into 75 scenes. There are 35 scenes with strong local and local/general characteristics, representing about 47% of the total 75 scenes. For the process of re-training a deep learning model, we created a "difficulty to classify" category, applied semi-supervised labeling method, selected the representative photos, and performed data augmentation technique to improve the classification accuracy of the model. In addition, we not only adjusted the classifier part to 75, which is common in the transfer learning for a deep learning model, but also updated all weights of the feature extraction part, which requires a lot of effort and creativity. As a result, our final model shows the Top 1 accuracy of 85.77% and Top 5 accuracy of 95.69%. The performance of our final model is reasonably good compared with the performance of the Inception-v3 model on ImageNet dataset, which showed 82.7% for Top 1 accuracy and 96.42% for Top 5 accuracy. The detailed re-training process presented in this study can serve as a guideline for the analysis of tourists' urban image through photo classification in other regions in the future. In addition, this study is meaningful in that it provides a practical method for classifying diverse and complex photos in urban or regional studies. classification maintain the category of training dataset, which does not properly reflect regional characteristics. Kim et al. pointed out that the overall accuracy was only 27.93% when checking the predicted label with "true" or "false" after classifying 38,891 photos in Seoul using Inception-v3 model trained on ImageNet dataset [24]. Figure 16 shows how the tourism scenes in our study are classified as the scenes in Places365 on the website http://places2.csail.mit.edu/index.html (accessed on 28 December 2020). The number represents the probability of being classified into that scene. Figure 16 shows that the "hanbok experience" scenes are wrongly classified as temple or water and photos taken in the "love lock" scenes as playground or shoe shop. These kinds of misclassifications are evident in the scenes that can be uniquely observed in Korea such as "street food", "traditional market", "traditional performance", "mural and trick art", " lantern fireworks festival", "lantern and alter", etc. Thus, it is essential to develop a tourists' photo classification suitable for local characteristics and classify photos accordingly. However, further studies are needed in the future. It is recommended to develop a deep learning model that can assign multiple labels to a photo or a hybrid deep learning model that can consider text data such as tags and titles in addition to location and photo