When BERT Started Traveling: TourBERT—A Natural Language Processing Model for the Travel Industry

: In recent years, Natural Language Processing (NLP) has become increasingly important for extracting new insights from unstructured text data, and pre-trained language models now have the ability to perform state-of-the-art tasks like topic modeling, text classiﬁcation, or sentiment analysis. Currently, BERT is the most widespread and widely used model, but it has been shown that a potential to optimize BERT can be applied to domain-speciﬁc contexts. While a number of BERT models that improve downstream tasks’ performance for other domains already exist, an optimized BERT model for tourism has yet to be revealed. This study thus aimed to develop and evaluate TourBERT, a pre-trained BERT model for the tourism industry. It was trained from scratch and outperforms BERT-Base in all tourism-speciﬁc evaluations. Therefore, this study makes an essential contribution to the growing importance of NLP in tourism by providing an open-source BERT model adapted to tourism requirements and particularities.


Introduction
Tourism products and services tend to be highly descriptive [1] as they cannot be tested in advance. In addition, tourism services are co-created with the customer and are relatively expensive compared to everyday products. As a result, the descriptions of products and services tend to be very excessive and text heavy. Alongside detailed descriptions from the supply side, user-generated content (UGC) continues to gain more relevance [2]. Whether on review platforms, such as TripAdvisor, or social media channels, such as Twitter, Facebook, or Instagram, individuals are constantly sharing their travel experiences and, in turn, influencing other users [3]. This content is of particular importance for tourism providers as they seem to be losing their power to UGC [4]. Therefore, to better understand consumer behavior and adapt to marketing initiatives, the automated analysis of texts using NLP methods is becoming increasingly important for both academia and the tourism industry [5]. At the same time, more powerful language models are emerging, enabling more advanced text analyses to be conducted.
BERT, developed by Google, is considered one of the most powerful and widely used language models. On the one hand, this pre-trained language model has been trained on a huge generic corpus and can be used universally. On the other hand, however, it has its weaknesses when it comes to domain-specific applications. Therefore, this paper aims to develop and evaluate a domain-specific BERT model for tourism. The proposed TourBERT model was pre-trained from scratch using 3.6 million tourist reviews and 46,000 descriptions of tourist services, attractions, and sights from more than 20 different countries around the world. This study makes a unique contribution to the extant body of natural language models and tourism research as the evaluation of TourBERT has proven its superiority to Digital 2022, 2 547 BERT in all tasks concerning tourism-relevant content. TourBERT can be rendered the stateof-the-art language model for the tourism industry and for academic text analytics alike owing to the fact that the pre-trained model can be fine-tuned to perform numerous tasks such as text representation, text classification and clustering, topic modeling, sentiment analysis, or question answering.

Literature Review
With an increase in computational power and more effective and efficient algorithms, abundant research has been conducted in recent years, both within academia and the tourism industry, on how to best process textual data. According to Wennker [6], 80% of all data that is produced is text-based, which underscores Poon's [7] statement that "information is the lifeblood of tourism." Especially since the rise of UGC, a vast amount of unstructured text has become available at one's disposal, the analysis of which can provide important insights into tourists and their wants, needs, and experiences that are highly relevant for tourism marketing [5].
Regardless, the analysis of text data is challenging and requires the conversion of text into numerical values, which are necessary to use as input data for powerful machine learning algorithms. Over the past years, a wide variety of language models have been developed, ranging from the pure analysis of word frequencies to complex transformer models that are able to process multilingual data and take content as well as context into account. Especially through the concept of transfer learning, which is based on the use of pre-trained models, huge progress in NLP has been archived. However, since such language models are trained on huge corpora, the training process is extremely time-consuming and computationally intense. The applied training corpus is therefore responsible for the field of application and the domain the model will work well in [8].
Since its launch in 2018, Google's Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most significant natural language models [9]. BERT-Large, which is based on a transformers architecture, is considered one of the most powerful language models, with 24 layers, 16 attention heads, and 340 million parameters in total [10]. It is a model pre-trained from scratch and can be fine-tuned to perform numerous downstream tasks such as text classification, question answering, sentiment analysis, extractive summarization, named entity recognition, or sentence similarity [8]. BERT-Base was pre-trained in a self-supervised way on a large English corpus consisting of raw texts from the BookCorpus dataset. This includes over 11,000 books in addition to the entire English Wikipedia. The nature of this training corpora implies that BERT was trained on a generic and unspecified domain corpus [11]. Yet, for domain-specific applications and downstream tasks, it has been proven that pre-training BERT on a large domain-specific corpus can be useful as it allows for better apprehension of linguistic peculiarities [12]. For example, several BERT variants have been pre-trained for the financial (FinBERT) [13], medical (Clinical BERT) [14], biological (BioBERT) [15], and computer science sectors (SciB-ERT) [16]. For tourism-related content, however, a domain-specific adaptation of BERT is not available on the market yet, hence why this paper introduces TourBERT. TourBERT will now be presented and evaluated in more detail in the next paragraphs.

Methodology and Results
The following sections describe the methodological procedure for the development of the TourBERT language model. The pre-training of TourBERT will be presented first, followed by its model evaluations. For the sake of clarity, the results of the five different evaluations are reported immediately after the description of each evaluation process.

Pre-Training TourBERT
TourBERT embodies BERT-Base-Uncased as its underlying architecture and was trained from scratch-unlike BioBERT or FinBERT, which were both pre-trained further from the BERT-Base initial checkpoint. The training corpus was pre-processed by convert-ing the data into lowercase and splitting it into sentences, ultimately resulting in 22,601,333 sentences in total. Thereafter, two TourBERT models with SentencePiece and WordPiece tokenizers were trained, respectively. The motivation to use SentencePiece rather than conventional WordPiece tokenizers in conjunction with BERT was to establish an opportunity to extend TourBERT to a multi-language model in the future since SentencePiece is able to account for grammatical peculiarities of different complex languages like Chinese. To obtain a custom vocabulary, SentencePiece (32,000) and WordPiece (30,522) tokenizers were trained, with the latter being equal to the size of the BERT-Base tokenizer. Pre-training of both models was done for 1M steps on a single Google Colab Pro TPU instance, which lasted about three days in total.

TourBERT Model Evaluation
The evaluation of TourBERT was performed using both quantitative and qualitative measures. Two sentiment classification tasks were used for the supervised evaluation, while topic modeling, synonyms search, and a within-vocabulary words similarity distribution analysis were applied as part of the unsupervised evaluation. It is important to note that the evaluation of supervised tasks used SentencePiece tokenizers only since both models had comparable performance, as will be shown below.

Supervised Evaluation: Sentiment Classification
For classification purposes, BERT's architecture must be extended with a classifier layer in order to enable predictions. This can be achieved in numerous ways; for example, one of the most widely used approaches is attaching a softmax layer on top of the BERT model. A more advanced way of designing a classifier, however, involves an Long short-term memory (LSTM) layer, which is useful for the representation of long sequences exceeding BERT's maximum input length. In the case of TourBERT, outputs were passed through a single feed-forward layer, a simple classifier known for benchmarking different transformer models against each other. Keeping in mind that an architecture as such would not yield state-of-the-art results, the aim was simply to demonstrate that TourBERT can surpass BERT-Base without tending to achieve superior results on a particular dataset.
The sentiment classification task was performed on two publicly available datasets involving hotel reviews. The first dataset contains 69,308 hotel reviews from Tripadvisor [17] and includes three sentiment classes: {-1: "negative", 0: "neutral", 1: "positive"}. The second dataset contains 515,000 reviews from Europe hotels [18]. Here, only reviews with either negative or positive labels were used, which, in turn, transformed this problem into a binary classification with the following two classes: {-1: "negative", 1: "positive"}. The dataset contains attributes such as hotel name, number of reviews, and geographical position as well as negative and positive reviews from each reviewer. If a user had left only positive reviews, then the value for the negative reviews was left blank, and vice-versa. The following pre-processing approach was thus used to extract only positive and negative examples in order to prepare this dataset for a binary classification problem: Only reviews from users who left either only negative or only positive reviews were included. Using this approach, 35,000 positive and 35,000 negative reviews were sampled resulting in 70,000 samples in total.
Both datasets were first pre-processed and then split into training, validation, and testing sets according to a 80%/10%/10% proportion. The pre-processing procedures included lowercasing and the removal of punctuation and non-ASCII characters from the text. Evaluation results for both tasks are shown in Tables 1 and 2 below, while Figure 1 presents the ROC curve and AUC score for TourBERT and BERT-Base models in the second task.

Unsupervised Evaluation: Visualization of Photo Annotations
The first unsupervised evaluation task was the visualization of photo annotations via TensorBoard Projector. For this task, a dataset of 48 photos depicting different tourism activities, such as sports activities, sightseeing, and shopping, amongst others, was applied. Next, 622 people were asked to manually label these photos by assigning two bigram tags to each individual photo. These annotations were then visualized using the TensorBoard Projector API, which allows for the visualization of original photos on a 2D or 3D plot located within their respective cluster centers. Finally, after performing UMAP, i.e., inspecting and comparing the groups' separation quality on the plot, the evaluation was complete. The visualization results for BERT-Base and TourBERT are presented in Figures 2 and 3, respectively.

Unsupervised Evaluation: Visualization of Photo Annotations
The first unsupervised evaluation task was the visualization of photo annotations via TensorBoard Projector. For this task, a dataset of 48 photos depicting different tourism activities, such as sports activities, sightseeing, and shopping, amongst others, was applied. Next, 622 people were asked to manually label these photos by assigning two bi-gram tags to each individual photo. These annotations were then visualized using the TensorBoard Projector API, which allows for the visualization of original photos on a 2D or 3D plot located within their respective cluster centers. Finally, after performing UMAP, i.e., inspecting and comparing the groups' separation quality on the plot, the evaluation was complete. The visualization results for BERT-Base and TourBERT are presented in Figures 2 and 3, respectively.

Unsupervised Evaluation: Visualization of Photo Annotations
The first unsupervised evaluation task was the visualization of photo annotations via TensorBoard Projector. For this task, a dataset of 48 photos depicting different tourism activities, such as sports activities, sightseeing, and shopping, amongst others, was applied. Next, 622 people were asked to manually label these photos by assigning two bigram tags to each individual photo. These annotations were then visualized using the TensorBoard Projector API, which allows for the visualization of original photos on a 2D or 3D plot located within their respective cluster centers. Finally, after performing UMAP, i.e., inspecting and comparing the groups' separation quality on the plot, the evaluation was complete. The visualization results for BERT-Base and TourBERT are presented in Figures 2 and 3, respectively.   The purpose of such a visualization is to evaluate the separation of clusters that naturally form from the down-projection method. Overall, one can observe that the TourBERT vectors lead to better group separation and that the pictures within each group contain similar content. Contrarily, when observing the results produced with BERT-Base vectors, the content of the pictures appear to be heavily mixed, without any visible cluster separation.

Unsupervised Evaluation: Topic Modeling
A subsequent unsupervised evaluation was undertaken by applying a topic modeling approach. For this, 5000 Instagram posts with the hashtag #wanderlust were extracted from public accounts and crawled using the Python Scrapy library. Instagram, as a social platform, principally utilizes photos to reflect its primary source of information, while the textual description of Instagram posts is often either limited to hashtags and emojis, unrelated to the photo, or missing entirely. Therefore, images were annotated using Google Cloud Vision API, and a TourBERT vector was generated for each photo annotation. Photo annotations were analyzed based on their similarity using a K-means clustering approach. The number of clusters was chosen using the silhouette score, which resulted in 25 clusters. In order to enable cluster center visualization on a 2D plot, a PCA down-projection method was selected to transform a 768-dimensional BERT embedding into a two-dimensional map. Figure 4 below shows the cluster centers on a 2D plot, where the size of a cluster center is proportional to the cluster's population size. A visualization as such allows the quality of the topic separation to be evaluated. The purpose of such a visualization is to evaluate the separation of clusters that naturally form from the down-projection method. Overall, one can observe that the TourBERT vectors lead to better group separation and that the pictures within each group contain similar content. Contrarily, when observing the results produced with BERT-Base vectors, the content of the pictures appear to be heavily mixed, without any visible cluster separation.

Unsupervised Evaluation: Topic Modeling
A subsequent unsupervised evaluation was undertaken by applying a topic modeling approach. For this, 5000 Instagram posts with the hashtag #wanderlust were extracted from public accounts and crawled using the Python Scrapy library. Instagram, as a social platform, principally utilizes photos to reflect its primary source of information, while the textual description of Instagram posts is often either limited to hashtags and emojis, unrelated to the photo, or missing entirely. Therefore, images were annotated using Google Cloud Vision API, and a TourBERT vector was generated for each photo annotation. Photo annotations were analyzed based on their similarity using a K-means clustering approach. The number of clusters was chosen using the silhouette score, which resulted in 25 clusters. In order to enable cluster center visualization on a 2D plot, a PCA down-projection method was selected to transform a 768-dimensional BERT embedding into a two-dimensional map. Figure 4 below shows the cluster centers on a 2D plot, where the size of a cluster center is proportional to the cluster's population size. A visualization as such allows the quality of the topic separation to be evaluated.
From Figure 4, one can notice that the cluster centers produced with the downprojected TourBERT vectors reveal better separation than those produced with BERT-Base ones.
Another aspect of the topic modeling analysis was the estimation of word similarity within the same cluster. Topic words for both BERT-Base and TourBERT can be seen in Tables 3 and 4.  From Figure 4, one can notice that the cluster centers produced with the downprojected TourBERT vectors reveal better separation than those produced with BERT-Base ones.
Another aspect of the topic modeling analysis was the estimation of word similarity within the same cluster. Topic words for both BERT-Base and TourBERT can be seen in Tables 3 and 4.     Although the hashtag #wanderlust may lead one to think of photos that, to some extent or another, contain natural landscapes, the topic model produced with TourBERT vectors was able to identify distinct topics like "underwater world" (topic 1), "beach activities" (topic 2), "food and drink" (topic 7), "vehicle" (topic 11), or "animals" (topic 24). An attempt to find similarly grouped clusters for the BERT-Base model did not result in such success since nearly every topic includes landscape descriptions. While several distinct topics were indeed found by the model, the majority of them contain mixed concepts, each one including terms describing nature or landscapes.
For better visibility and to gain a better understanding of the quality and distinction of the topics, another visualization for each of the two topic models was produced, as can be seen in Figures 5 and 6. Each figure contains a table, with the first column presenting words for a given topic (see Tables 3 and 4) and all subsequent columns depicting the top 10 most similar samples, i.e., photos for that topic. distinct topics were indeed found by the model, the majority of them contain mixed concepts, each one including terms describing nature or landscapes.
For better visibility and to gain a better understanding of the quality and distinction of the topics, another visualization for each of the two topic models was produced, as can be seen in Figures 5 and 6. Each figure contains a table, with the first column presenting words for a given topic (see Tables 3 and 4) and all subsequent columns depicting the top 10 most similar samples, i.e., photos for that topic.  When inspecting the results from both models, it becomes apparent that the clusters created through TourBERT are much more homogenous within the clusters themselves and quite heterogeneous across clusters. On the other hand, those generated by BERT-Base occasionally include photos that are relatively dissimilar to each other despite belonging to the same topic, such as in topic 3. When inspecting the results from both models, it becomes apparent that the clusters created through TourBERT are much more homogenous within the clusters themselves and quite heterogeneous across clusters. On the other hand, those generated by BERT-Base occasionally include photos that are relatively dissimilar to each other despite belonging to the same topic, such as in topic 3.

Unsupervised Evaluation: User Study
To further investigate the quality of each topic produced by the abovementioned models and prove the assumptions made thus far, a user study was conducted on the same set of images and annotations to statistically evaluate the results. First, a set of the 10 most similar photos for each of the 25 clusters produced by BERT-Base and TourBERT was created. Thereafter, users were asked to evaluate the similarity of the photos within each of the 50 clusters using a seven-point Likert scale, with possible answers ranging from "very similar" to "very different" (see Figure 7). Similar to measuring the intercoder reliability in qualitative studies, this evaluation approach allowed for an intersubjective perception of the quality of the clusters. Throughout this process, the image clusters were shown to the participants in a rotating manner, i.e., alternating randomly. To investigate this study's results, a pairwise t-test was performed with SPSS, the results of which are presented in Table 5 below. The coding ranged from 1-very similar to 7-very different, with the mean values being 3.75 and 2.5 for BERT-Base and TourBERT, respectively, at a highly significant level (Sig. two-sided = 0.000). Effect size was measured with Cohen´s d, yielding a medium-level effect of 0.517.  To investigate this study's results, a pairwise t-test was performed with SPSS, the results of which are presented in Table 5 below. The coding ranged from 1-very similar to 7-very different, with the mean values being 3.75 and 2.5 for BERT-Base and TourBERT, respectively, at a highly significant level (Sig. two-sided = 0.000). Effect size was measured with Cohen s d, yielding a medium-level effect of 0.517. From the results above, it can be concluded that the similarity between the annotated images was perceived significantly better with TourBERT than with BERT-Base.

Unsupervised Evaluation: Synonyms Search
Assuming that BERT-Base, due to the fact that it had been trained on a generic corpus, would achieve more generic results than the TourBERT model, which had been trained on a tourism-specific corpus, it was hypothesized that a similarity search of tourism-related terms would lead to better results with TourBERT than with BERT-Base. Therefore, with the help of a tourism-domain expert, words containing multiple semantic meanings in general as well as tourism-specific contexts were selected. For example, the word "transfer" has multiple meanings and is usually associated with "transformation", "transplantation", and so on; however, from a tourist's perspective, associations such as "taxi", "pick up", or "hotel transfer" might come to mind. The output of the top eight most similar words for each term can be seen in Tables 6 and 7 for both BERT-Base and TourBERT alike.   uniqueness  experince  entry  destination  tickets  spot  ##guide  transfers  exploring  sevice  ambience  expereince  enterance  feature  entry  attraction  guides  transport  sights  services  originality  experiance  admittance  landmark  entrance  place  tourguide  pickup  attractions  staff  intimacy  adventure  admission  place  wristband  point  guid  transportation  exploration  personnel  charm  experiences  ticket  institution  admission  itinerary  driver  journey  nightlife  hospitality  accuracy  enjoyment  fee  museum  fee  hotspot  interpreter  limousine  hiking  personel  flare  opportunity  carpark  spot  pass  venture  guiding  shuttle  outings  frontdesk  warmth  expere  payment  site  tix  hangout  narrator  pickups  excursions  housekeeping From a technical perspective, the native implementation of BERT does not allow for the querying of most similar words since, unlike Word2Vec or FastText models, BERT does not contain static vectors but, rather, produces them dynamically. As a result, it can output two completely different vectors for the same word based on the context it was mentioned in. As the intention is still to compare words as standalone context-independent units, an algorithm that enables any BERT-like model to query its vocabulary in order to find the most similar words was constructed. The algorithm works as follows: For the first step, pairwise similarities between all the words in BERT's vocabulary were computed resulting in a 30,522 × 30,522 matrix. Then, using the KDTree algorithm from Python's Sklearn library, a search index was built on that matrix, which allows for fast querying.
When comparing synonyms produced by BERT-Base and TourBERT, one can see that TourBERT captures the tourism-specific meaning of a given word almost perfectly. On the contrary, BERT-Base captures a more generic meaning of the same word. For example, TourBERT associates the word "ticket" with "entrance" and "wristband", whereas BERT-Base considers the same word in the scope of public transport, presenting words like "trains", "bus", and "metro". To provide another example, the word "destination" is associated via the BERT-Base model with words such as "dying", "choice", "lame", and "address", whereas TourBERT outputs "spot", "attraction", "place", and other words that are closely related to "destination" in a tourism context.

Conclusions
In tourism research as well as in the tourism industry, the automatic analysis of texts is becoming increasingly important. Language models are needed to perform a variety of downstream tasks such as topic modeling, text classification, entity recognition, sentiment analysis, or information extraction. However, it has been shown that the quality of the domain-specific use of pre-trained models depends significantly on the training corpus itself. While optimized language models have already been developed for business-and scientific domains, such as the financial [13], medical [14], or biological [15] sectors, this has yet to be the case for tourism. Therefore, the aim of this study was to optimize the most important and widely used language model to date, BERT, for tourism-specific applications. By means of five different evaluation tasks, the successful completion of all tasks could be demonstrated, proving the applicability and performance of TourBERT for tourism contexts. TourBERT outperformed BERT-Base in all domain-specific tasks and thus represents a suitable language model for academia and the tourism industry. This study further contributes to the discussion of the importance of domain-specific language models from a theoretical perspective, while, from a methodological point of view, it provides detailed insights into the development and training of TourBERT. As a result, this study can also be seen as a guide on how to train and evaluate BERT models for other domains. The practical contribution lies in making TourBERT available to the open-source community: The model is hosted on the Hugging Face Model Hub and accessible via https://huggingface.co/veroman/TourBERT (accessed on 23 May 2022). TourBERT is thus freely accessible and ready to use for tourism-specific NLP tasks. Although an attempt was made to ensure that the training corpus was as multi-layered as possible and that the intercultural dimension, a very important aspect for tourism, was taken into account, an even larger training corpus would most likely lead to increased performance rates. In particular, the inclusion of scientific texts would be useful at this point in order to better analyze texts, such as scientific books and papers, in the context of tourism.
Author Contributions: Conceptualization, V.A. and R.E.; methodology, V.A. and R.E.; evaluation, V.A. and R.E.; writing V.A. and R.E. All authors have contributed equally. All authors have read and agreed to the published version of the manuscript.
Funding: This project was carried out without funding.

Data Availability Statement:
We publicly release the TourBERT model which is available on Hugging Face Model Hub and is accessible through https://huggingface.co/veroman/TourBERT (accessed on 23 May 2022).

Conflicts of Interest:
The authors declare no conflict of interest.