Korean Tourist Spot Multi-Modal Dataset for Deep Learning Applications

: Recently, deep learning-based methods for solving multi-modal tasks such as image captioning, multi-modal classiﬁcation, and cross-modal retrieval have attracted much attention. To apply deep learning for such tasks, large amounts of data are needed for training. However, although there are several Korean single-modal datasets, there are not enough Korean multi-modal datasets. In this paper, we introduce a KTS (Korean tourist spot) dataset for Korean multi-modal deep-learning research. The KTS dataset has four modalities (image, text, hashtags, and likes) and consists of 10 classes related to Korean tourist spots. All data were extracted from Instagram and preprocessed. We performed two experiments, image classiﬁcation and image captioning with the dataset, and they showed appropriate results. We hope that many researchers will use this dataset for multi-modal deep-learning research.

social network service (Instagram), and each data instance consists of an image, text, hashtags, and likes of a post.The dataset has 10 classes of 1000 instances each (total of 10,000 instances).
The KTS dataset can be used not only to perform simple image classification or sentiment analysis for text data, but also to perform various multi-modal tasks, such as image captioning, multi-modal classification, and recommendation-system simulation.In the experimental part of this study (see Section 4), we conducted two simple experiments, image classification, and image captioning.The experiment results show that meaningful performances can be achieved with this dataset in general deep learning.

Data Description
Instagram is a photo-or video-sharing social networking service owned by Facebook [13].
Instagram-user posts include images, texts, hashtags, likes, user ids, and other users' comments.We extracted the images, texts, hashtags, and likes from the above elements, using a web scraping technique, and sensitive information (e.g., user ID and URLs of the post) was removed.The KTS dataset has 10,000 instances collected from the posts related to Korean tourist spots uploaded to Instagram.Table 1 shows a schematic of the dataset.For example, the first row shows an instance of "beach" sub-class that contains an image, texts "is this real life??Real-time Udo.Jeju-do is awesome.Sea color is also beautiful", hashtags meaning "#travel", "#Udo (island)", and "#Hagosudong beach", and the likes count.

Class Structure
Table 2 shows the class structure.The super-class is divided into "person-made" and "naturescene" for the tourist spot domain, and each has five sub-classes: amusement park, palace, park, restaurant, and tower for person-made and beach, cave, island, lake, and mountain for nature-scene.There are 1000 instances (image, text, hashtags, and likes) for each sub-class.

Class Structure
Table 2 shows the class structure.The super-class is divided into "person-made" and "naturescene" for the tourist spot domain, and each has five sub-classes: amusement park, palace, park, restaurant, and tower for person-made and beach, cave, island, lake, and mountain for nature-scene.There are 1000 instances (image, text, hashtags, and likes) for each sub-class.

Class Structure
Table 2 shows the class structure.The super-class is divided into "person-made" and "naturescene" for the tourist spot domain, and each has five sub-classes: amusement park, palace, park, restaurant, and tower for person-made and beach, cave, island, lake, and mountain for nature-scene.There are 1000 instances (image, text, hashtags, and likes) for each sub-class.

Class Structure
Table 2 shows the class structure.The super-class is divided into "person-made" and "naturescene" for the tourist spot domain, and each has five sub-classes: amusement park, palace, park, restaurant, and tower for person-made and beach, cave, island, lake, and mountain for nature-scene.There are 1000 instances (image, text, hashtags, and likes) for each sub-class.

Super-Class
Person-Made Nature-Scene Table 2 shows the class structure.The super-class is divided into "person-made" and "nature-scene" for the tourist spot domain, and each has five sub-classes: amusement park, palace, park, restaurant, Data 2019, 4, 139 3 of 9 and tower for person-made and beach, cave, island, lake, and mountain for nature-scene.There are 1000 instances (image, text, hashtags, and likes) for each sub-class.

Data Structure
There is a total version and split version of the dataset.The split version is provided in a 7:1:2 ratio, divided by train, valid, and test.The total version contains all the data to allow users to divide the dataset to the desired ratio.We also provide the code to split the dataset.Each of the four folders (total, train, valid, and test) consists of two super-classes and ten sub-classes, like the class structure presented in Table 2.Each class has an image folder that contains image data and a json file that includes text, hashtags, and likes in json format.Figure 1 shows an example of the data structure of the KTS dataset.For instance, the first picture shows the 2nd image data for the mountain class in the total folder.The json file contains data such as text, likes, etc., which form a pair for this image.The "text" refers to the texts that are extracted from the posts and the "label" refers to the name of the sub-class of instance.The "hashtag" refers the hashtags of post, and the "img_name" refers to image file names that are stored in the image folder."likes" refers to the numbers of likes of the post at the time of data collection.The data structure allows users to load a json file and an image file together.
Data 2019, 4, x FOR PEER REVIEW 3 of 9 restaurant, and tower for person-made and beach, cave, island, lake, and mountain for nature-scene.
There are 1000 instances (image, text, hashtags, and likes) for each sub-class.

Data Structure
There is a total version and split version of the dataset.The split version is provided in a 7:1:2 ratio, divided by train, valid, and test.The total version contains all the data to allow users to divide the dataset to the desired ratio.We also provide the code to split the dataset.Each of the four folders (total, train, valid, and test) consists of two super-classes and ten sub-classes, like the class structure presented in Table 2.Each class has an image folder that contains image data and a json file that includes text, hashtags, and likes in json format.Figure 1 shows an example of the data structure of the KTS dataset.For instance, the first picture shows the 2nd image data for the mountain class in the total folder.The json file contains data such as text, likes, etc., which form a pair for this image.The "text" refers to the texts that are extracted from the posts and the "label" refers to the name of the subclass of instance.The "hashtag" refers the hashtags of post, and the "img_name" refers to image file names that are stored in the image folder."likes" refers to the numbers of likes of the post at the time of data collection.The data structure allows users to load a json file and an image file together.

Images
Every image in sub-classes is saved in jpg format and numbered from 1 to 1000 for each data instance.It is composed of the images that can represent each class well, and image classification tasks can be performed using only this image data.We conducted some experiments, and the results are described in Section 4.1.Since the images posted on Instagram are stored without any modification, the image sizes are various.Because the images are related to the tourist spots, there  in images, including people, but there is no image that clearly shows a face to be recognized as a specific person.Figure 2 shows the examples of images.
Data 2019, 4, x FOR PEER REVIEW 4 of 9 are many components in images, including people, but there is no image that clearly shows a face to be recognized as a specific person.Figure 2 shows the examples of images.

Texts
Since the texts are extracted from Instagram posts written by Koreans, they are mostly in Korean, and some special symbols (., !, ?, @, +, *, etc.) and numbers are also included.The personal information of the user and others in the post is not included.The names of persons which must appear in the context are replaced as "철수" for the male and "영희" for the female, which are very common, names like "Jack" and "Jill" in English.In addition, grammar and spelling errors in Korean commonly found in Instagram are reflected without correction.Table 3 shows the examples of texts.As an example, the text for restaurant sub-class is "Crispy egg tart is delicious."

Hashtags and Likes
The hashtags are in Korean and English, and they are closely related to the content of the posts."likes" refers to the number of likes of posts at the time of data collection.Due to the nature of the Instagram, the number of hashtags in a post varies from 0 to tens, and the number of likes also varies from 0 to thousands.Figure 3 shows the distribution of hashtags and likes in the dataset.

Texts
Since the texts are extracted from Instagram posts written by Koreans, they are mostly in Korean, and some special symbols (., !, ?, @, +, *, etc.) and numbers are also included.The personal information of the user and others in the post is not included.The names of persons which must appear in the context are replaced as "ᄎ ᅥ ᆯᄉ ᅮ" for the male and "ᄋ ᅧ ᆼᄒ ᅴ" for the female, which are very common, names like "Jack" and "Jill" in English.In addition, grammar and spelling errors in Korean commonly found in Instagram are reflected without correction.Table 3 shows the examples of texts.As an example, the text for restaurant sub-class is "Crispy egg tart is delicious."

Coarse Label Fine Label Example
person-made

Hashtags and Likes
The hashtags are in Korean and English, and they are closely related to the content of the posts."likes" refers to the number of likes of posts at the time of data collection.Due to the nature of the Instagram, the number of hashtags in a post varies from 0 to tens, and the number of likes also varies from 0 to thousands.Figure 3 shows the distribution of hashtags and likes in the dataset.
hashtags that are frequent in person-made and nature-scene classes, respectively.The larger the word size, the more words are included in the dataset.For example, in Figure 4b, there are many words related to person-made classes, such as "서울 (Seoul)", "궁궐 (palace)", and "식당 (restaurant)", and in Figure 4c, there are many words related to nature-scene classes, such as "바다 (sea)", "제주 (Jeju)", and "풍경 (scene)".

Methods
As described in Table 2, we divided the class structure into super-classes (person-made, naturescene) and five sub-classes for each super-class.In order to deal with tourist spot domain, we designed a collecting and preprocessing procedure so that sub-class data can cover tourist spot domain, as well.Data collection and data preprocessing took two months, from January to February in 2019, and were implemented through Python [16].In addition, we used the BeautifulSoup package [17] for data collection.

Data Collection
The posts were obtained from Instagram, using queries related to the sub-classes.For example, to collect data for the sub-class "beach", we used specific tourist-spot beach names, such as "경포해수욕장 (Gyeongpo beach)"and "해운대 (Hawoondae)", as queries.Then, we extracted the post's information from the HTML code of the post.After extracting the information from the HTML code, the images were saved as jpg files, and the texts, hashtags, likes, user ids, comments, and post We trained and visualized the distributed representation of hashtag words using Word2Vec model [14].Figure 4a shows the two-dimensional visualization of the hashtag word vectors learned for each of the nature-scene and person-made classes.The Word2Vec visualization shows that the hashtags have somewhat separate distributions for each class.We also visualized the frequency of hashtags using the WordClouds application [15], to verify if hashtags consist of words that are appropriate for each super-class.Figures 4b and 4c show the results of WordClouds, visualize hashtags that are frequent in person-made and nature-scene classes, respectively.The larger the word size, the more words are included in the dataset.For example, in Figure 4b, there are many words related to person-made classes, such as "ᄉ ᅥᄋ ᅮ ᆯ (Seoul)", "ᄀ ᅮ ᆼᄀ ᅯ ᆯ (palace)", and "ᄉ ᅵ ᆨᄃ ᅡ ᆼ (restaurant)", and in Figure 4c, there are many words related to nature-scene classes, such as "ᄇ ᅡᄃ ᅡ (sea)", "ᄌ ᅦᄌ ᅮ (Jeju)", and "ᄑ ᅮ ᆼᄀ ᅧ ᆼ (scene)".
We trained and visualized the distributed representation of hashtag words using Word2Vec model [14].Figure 4a shows the two-dimensional visualization of the hashtag word vectors learned for each of the nature-scene and person-made classes.The Word2Vec visualization shows that the hashtags have somewhat separate distributions for each class.We also visualized the frequency of hashtags using the WordClouds application [15], to verify if hashtags consist of words that are appropriate for each super-class.Figure 4b and Figure 4c show the results of WordClouds, visualize hashtags that are frequent in person-made and nature-scene classes, respectively.The larger the word size, the more words are included in the dataset.For example, in Figure 4b, there are many words related to person-made classes, such as "서울 (Seoul)", "궁궐 (palace)", and "식당 (restaurant)", and in Figure 4c, there are many words related to nature-scene classes, such as "바다 (sea)", "제주 (Jeju)", and "풍경 (scene)".

Methods
As described in Table 2, we divided the class structure into super-classes (person-made, naturescene) and five sub-classes for each super-class.In order to deal with tourist spot domain, we designed a collecting and preprocessing procedure so that sub-class data can cover tourist spot domain, as well.Data collection and data preprocessing took two months, from January to February in 2019, and were implemented through Python [16].In addition, we used the BeautifulSoup package [17] for data collection.

Data Collection
The posts were obtained from Instagram, using queries related to the sub-classes.For example, to collect data for the sub-class "beach", we used specific tourist-spot beach names, such as "경포해수욕장 (Gyeongpo beach)"and "해운대 (Hawoondae)", as queries.Then, we extracted the post's information from the HTML code of the post.After extracting the information from the HTML code, the images were saved as jpg files, and the texts, hashtags, likes, user ids, comments, and post

Methods
As described in Table 2, we divided the class structure into super-classes (person-made, nature-scene) and five sub-classes for each super-class.In order to deal with tourist spot domain, we designed a collecting and preprocessing procedure so that sub-class data can cover tourist spot domain, as well.Data collection and data preprocessing took two months, from January to February in 2019, and were implemented through Python [16].In addition, we used the BeautifulSoup package [17] for data collection.

Data Collection
The posts were obtained from Instagram, using queries related to the sub-classes.For example, to collect data for the sub-class "beach", we used specific tourist-spot beach names, such as "ᄀ ᅧ ᆼᄑ ᅩ ᄒ ᅢᄉ ᅮᄋ ᅭ ᆨᄌ ᅡ ᆼ (Gyeongpo beach)"and "ᄒ ᅢᄋ ᅮ ᆫᄃ ᅢ (Hawoondae)", as queries.Then, we extracted the post's information from the HTML code of the post.After extracting the information from the HTML code, the images were saved as jpg files, and the texts, hashtags, likes, user ids, comments, and post URLs were saved as json files.If multiple images were registered in the post, only the first image shown at the front was collected.

Images
We selected only the instances of images related to Korean tourist spots and sub-classes from all collected data.We also excluded instances of images that contain sensitive information, such as a face that is recognizable as a specific person or personal information such as a phone number.In consideration of the difficulty of utilization, instances of images with framed decorations or white or black solid margins on the edges of the images were removed.Instances with sensitive parts, such as copyrighted images with logos, were also excluded.All these processes were done manually, and the images were not resized or cropped.

Texts and Hashtags
In the case of texts and hashtags, they were refined for 10,000 instances acquired through the images preprocess.We removed emojis, except for some special symbols, and characters such as Chinese or Japanese were removed or translated.In addition, like the images, information such as names, user ids, and phone numbers, that can identify an individual was also removed or modified.Also, comments and post URLs were removed from the instances.

Experiments
We conducted two simple experiments to verify that all the data were collected appropriately.The first experiment was the image classification using some of the recent Convolutional Neural Networks [18], which is described in Section 4.1.In the second experiment, a simple image-captioning task [19] is preformed using images and texts, which is described in Section 4.2.

Image Classification Using DCNN (Deep Convolutional Neural Networks)
To verify that the images were adequately collected, we fine-tuned several deep CNN models using the images.The selected CNN models were VGG16 [20], ResNet18 [21], and DenseNet121 [22], and the hyperparameters were set to the optimizer as momentum SGD [23], the learning rate as 0.001 (scheduling), and batch size as 4. All experiments were performed in the same setting.Since there were also 10 classes, similarly, we fine-tuned CIFAR-10 dataset [2] and compared their performances.Table 4 shows the Top-1 accuracies for each model.As described in Table 4, the deep CNN models show good performances for the KTS dataset, like the CIFAR-10.In fact, we can achieve higher performance if we fine-tune VGG or ResNet using CIFAR-10 properly, but in this experiment, we trained all the models with the same hyperparameters setting.The images of KTS dataset can be used in a single modal experiment like an image classification, but, basically, the dataset is a multi-modal in which multiple single-modal data make up one instance, so it can be used on many complex tasks.We conducted a simple image captioning using images and texts in the dataset.The goal of image captioning is to convert a given image into a text description.Usually, an encoder-decoder framework is used for this task.The encoder uses a CNN model, and the decoder uses a recurrent model, such as LSTM or GRU [24,25].In this experiment, we used pre-trained DenseNet152 encoder and LSTM decoder [22,24], and the hyperparameters were set to the Adam [26], the learning rate as 0.001, and batch size as 128.Table 5 shows several samples of the test results for image captioning.Adam [26], the learning rate as 0.001, and batch size as 128.Table 5 shows several samples of the test results for image captioning.
In Table 5, the image of the first column is from the sub-class "beach".The ground truth text for this is described in the second row, and it translates to "The weather is nice, Gyeongpodae".The caption generated by the neural network for this test data is described in the third row.It translates into English as "The winter sea is so good.I like traveling alone".The vocabulary we built for image captioning has 23,147 words.In this experiment, we trained the model using cross-entropy loss to reduce the differences between target sentences and prediction sentences.Also, we measured a perplexity [27], which is a simple way of evaluating language models.Figure 5

Conclusions and Future Work
We created the KTS dataset for multi-modal tasks in the field of machine learning.The KTS dataset was designed for research with Korean texts, and it consists of images, texts, hashtags, and Adam [26], the learning rate as 0.001, and batch size as 128.Table 5 shows several samples of the test results for image captioning.
In Table 5, the image of the first column is from the sub-class "beach".The ground truth text for this is described in the second row, and it translates to "The weather is nice, Gyeongpodae".The caption generated by the neural network for this test data is described in the third row.It translates into English as "The winter sea is so good.I like traveling alone".The vocabulary we built for image captioning has 23,147 words.In this experiment, we trained the model using cross-entropy loss to reduce the differences between target sentences and prediction sentences.Also, we measured a perplexity [27], which is a simple way of evaluating language models.Figure 5

Conclusions and Future Work
We created the KTS dataset for multi-modal tasks in the field of machine learning.The KTS dataset was designed for research with Korean texts, and it consists of images, texts, hashtags, and Adam [26], the learning rate as 0.001, and batch size as 128.Table 5 shows several samples of the test results for image captioning.
In Table 5, the image of the first column is from the sub-class "beach".The ground truth text for this is described in the second row, and it translates to "The weather is nice, Gyeongpodae".The caption generated by the neural network for this test data is described in the third row.It translates into English as "The winter sea is so good.I like traveling alone".The vocabulary we built for image captioning has 23,147 words.In this experiment, we trained the model using cross-entropy loss to reduce the differences between target sentences and prediction sentences.Also, we measured a perplexity [27], which is a simple way of evaluating language models.Figure 5

Conclusions and Future Work
We created the KTS dataset for multi-modal tasks in the field of machine learning.The KTS dataset was designed for research with Korean texts, and it consists of images, texts, hashtags, and Adam [26], the learning rate as 0.001, and batch size as 128.Table 5 shows several samples of the test results for image captioning.
In Table 5, the image of the first column is from the sub-class "beach".The ground truth text for this is described in the second row, and it translates to "The weather is nice, Gyeongpodae".The caption generated by the neural network for this test data is described in the third row.It translates into English as "The winter sea is so good.I like traveling alone".The vocabulary we built for image captioning has 23,147 words.In this experiment, we trained the model using cross-entropy loss to reduce the differences between target sentences and prediction sentences.Also, we measured a perplexity [27], which is a simple way of evaluating language models.Figure 5

Conclusions and Future Work
We created the KTS dataset for multi-modal tasks in the field of machine learning.The KTS dataset was designed for research with Korean texts, and it consists of images, texts, hashtags, and In Table 5, the image of the first column is from the sub-class "beach".The ground truth text for this is described in the second row, and it translates to "The weather is nice, Gyeongpodae".The caption generated by the neural network for this test data is described in the third row.It translates into English as "The winter sea is so good.I like traveling alone".
The vocabulary we built for image captioning has 23,147 words.In this experiment, we trained the model using cross-entropy loss to reduce the differences between target sentences and prediction sentences.Also, we measured a perplexity [27], which is a simple way of evaluating language models.Figure 5 shows that training loss and perplexity are reduced for each epoch for the image-captioning experiment.
Data 2019, 4, x FOR PEER REVIEW 7 of 9 the decoder uses a recurrent model, such as LSTM or GRU [24,25].In this experiment, we used pretrained DenseNet152 encoder and LSTM decoder [22,24], and the hyperparameters were set to the Adam [26], the learning rate as 0.001, and batch size as 128.Table 5 shows several samples of the test results for image captioning.
In Table 5, the image of the first column is from the sub-class "beach".The ground truth text for this is described in the second row, and it translates to "The weather is nice, Gyeongpodae".The caption generated by the neural network for this test data is described in the third row.It translates into English as "The winter sea is so good.I like traveling alone".The vocabulary we built for image captioning has 23,147 words.In this experiment, we trained the model using cross-entropy loss to reduce the differences between target sentences and prediction sentences.Also, we measured a perplexity [27], which is a simple way of evaluating language models.Figure 5

Conclusion and Future Work
We created the KTS dataset for multi-modal tasks in the field of machine learning.The KTS dataset was designed for research with Korean texts, and it consists of images, texts, hashtags, and likes of Instagram posts on Korean tourist spots.The dataset can be used to perform a variety of 이름도 참 길다...바닥이 유리로 만들어져서 다 보임! 고소공포증 있는 사람에게는 비추천"

Figure 1 .
Figure 1.An example of the data structure.

Figure 1 .
Figure 1.An example of the data structure.

2. 2
.1.Images Every image in sub-classes is saved in jpg format and numbered from 1 to 1000 for each data instance.It is composed of the images that can represent each class well, and image classification tasks can be performed using only this image data.We conducted some experiments, and the results are described in Section 4.1.Since the images posted on Instagram are stored without any modification, the image sizes are various.Because the images are related to the tourist spots, there are many components Data 2019, 4, 139 4 of 9

Figure 2 .
Figure 2. The examples of images.

Figure 2 .
Figure 2. The examples of images.

Figure 3 .
Figure 3.The distribution of hashtags (a) and likes (b).The x-axis represents the number of hashtags, and likes and the y-axis represents the number of instances, respectively.

Figure 4 .
Figure 4.A 2D visualization of hashtag vectors trained by Word2Vec (a), and a WordClouds of hashtags in person-made class (b) and nature-scene class (c).

Figure 3 .
Figure 3.The distribution of hashtags (a) and likes (b).The x-axis represents the number of hashtags, and likes and the y-axis represents the number of instances, respectively.

Figure 3 .
Figure 3.The distribution of hashtags (a) and likes (b).The x-axis represents the number of hashtags, and likes and the y-axis represents the number of instances, respectively.

Figure 4 .
Figure 4.A 2D visualization of hashtag vectors trained by Word2Vec (a), and a WordClouds of hashtags in person-made class (b) and nature-scene class (c).

Figure 4 .
Figure 4.A 2D visualization of hashtag vectors trained by Word2Vec (a), and a WordClouds of hashtags in person-made class (b) and nature-scene class (c).

Figure 5 .
Figure 5.The loss (a) and perplexity (b) graphs for image captioning.

Figure 5 .
Figure 5.The loss (a) and perplexity (b) graphs for image captioning.

Figure 5 .
Figure 5.The loss (a) and perplexity (b) graphs for image captioning.

Figure 5 .
Figure 5.The loss (a) and perplexity (b) graphs for image captioning.

Figure 5 .
Figure 5.The loss (a) and perplexity (b) graphs for image captioning.

Figure 5 .
Figure 5.The loss (a) and perplexity (b) graphs for image captioning.

Table 1 .
The overall schematic of Korean tourist spot (KTS) dataset.
Data 2019, 4, x FOR PEER REVIEW 3 of 10

Table 1 .
The overall schematic of Korean tourist spot (KTS) dataset.

Table 1 .
The overall schematic of Korean tourist spot (KTS) dataset.

Table 2 .
A class structure.

Table 2 .
A class structure.

Table 2 .
A class structure.

Table 2 .
A class structure.

Table 2 .
A class structure.

Table 3 .
The examples of texts.

Table 3 .
The examples of texts.

Table 4 .
Top-1 classification accuracies of deep CNN models.

Table 5 .
Sample test results for image captioning.

Table 5 .
Sample test results for image captioning.

Table 5 .
Sample test results for image captioning.

Table 5 .
Sample test results for image captioning.

Table 5 .
Sample test results for image captioning.

Table 5 .
Sample test results for image captioning.