Exploring User-Generated Content for Improving Destination Knowledge: The Case of Two World Heritage Cities

: This study explores two World Heritage Sites (WHS) as tourism destinations by applying several uncommon techniques in these settings: Smart Tourism Analytics, namely Text mining, Sentiment Analysis, and Market Basket Analysis, to highlight patterns according to attraction, nationality, and repeated visits. Salamanca (Spain) and Coimbra (Portugal) are analyzed and compared based on 8,638 online travel reviews (OTR), from TripAdvisor (2017–2018). Findings show that WHS reputation does not seem to be relevant to visitors-reviewers. Additionally, keyword extraction reveals that the reviews do not di ﬀ er from language to language or from city to city, and it was also possible to identify several keywords related to history and heritage; in particular, architectural styles, names of kings, and places. The study identiﬁes topics that could be used by destination management organizations to promote these cities, highlights the advantages of applying a data science approach, and conﬁrms the rich information value of OTRs as a tool to (re)position the destination according to smart tourism design tenets.


Introduction
From the beginning of the Web 2.0 platforms, we have witnessed an increase in travelers' and visitors' active participation when reviewing their travels or visits on social media. Studies point to the significant impacts on companies, organizations, and stakeholders of heritage and cultural tourism of the prevalence of participatory, interactive, and dynamic internet, centered on the consumer [1]. From the perspective of demand, these technologies allow tourists to access more information and more knowledge and control many aspects of their trip [2,3]. Consequently, platforms have radically transformed the way tourists perceive and engage with heritage [4]. One of the fundamental aspects of this involvement is the fact that tourists use new technologies to constantly share their experiences and impressions of the places they visit, thus competing with promotional materials and strongly influencing the decision-making process [5][6][7].
When we juxtapose cultural and heritage tourism and Web 2.0, there is still a lot of theoretically robust research to be developed [7]. In fact, there is an information gap between what visitors want and what they find during their trips. Additionally, from the perspective of management and marketing, it is necessary to understand who the UNESCO World Heritage tourists are in order to offer positive experiences without downplaying the much-needed support for place conservation [8]. Thus, one of the fundamental questions when carrying out tourist research on historical urban spaces is understanding what motivates individuals and families to visit heritage cities. This study attempts to answer this question through tourists' own reviews on online travel-research that, to date, does not seem to have been conducted, either in Portugal or elsewhere. On the other hand, comparative studies, which go beyond circumscribed local contexts, are essential to confront and unveil the similarities of what is often "thought to be historical-cultural particularisms and the uniqueness of local heritage" [9].
The general objective of this study is to emphasize the value of comprehensively exploring and understanding the main concerns and opinions the visitors' reviews portray about the sites they visit, based on the collection and exploitation of data available on TripAdvisor social media, so that those responsible for heritage management can arrive at informed decisions. As such, this study explores two World Heritage Sites (WHS) as tourism destinations-Salamanca in Spain and Coimbra in Portugal-by applying Data Science techniques such as Text Mining, Sentiment Analysis, and Market Basket Analysis to highlight patterns according to attraction, nationality, and repeated visits.
As a specific objective, we present a mixed exploratory methodology for the comparative analysis of an OTR (Online Travel Review) corpus of 8638 reviews in three languages: English, Spanish and Portuguese, published by visitors to UNESCO heritage cities-Coimbra, Portugal, and Salamanca, Spain, in 2017 and 2018. The two cities were chosen because they have common characteristics, as described below, and these three languages were chosen because they correspond to the languages of the relevant tourist markets in the two countries [10,11].
Our findings will provide yet another instrument for those responsible for heritage management to define objectives and strategies for the sites' sustainable promotion. Our research is based on the collection and exploitation of data available on TripAdvisor. It should be noted that in 2009, UNESCO and TripAdvisor launched a partnership to mobilize the support of visitors to preserve natural and cultural sites inscribed on the UNESCO World Heritage List [12].
From the perspective of research and analysis techniques and methods, there has been a growing interest in mixed methodological approaches, which encourage interdisciplinarity in studies in the area of Tourism and Hospitality [13,14]. However, concerning research with large amounts of textual data collected from social media and other forms of computerized media coverage, there seems to be an overall tendency for quantitative methods, which implies less informational wealth and a less holistic understanding of the phenomenon under analysis [15]. There are interesting exceptions, such as Rodrigues, Brochado, and Troilo [16], who focus on tourists' wellbeing by applying sentiment analysis to post-experience booking.com reviews. Thus, the specific aim of this study is to present how a mixed exploratory methodology for the analysis of an OTR corpus can be used to uncover useful insights which can be used by Destination Management Organizations (DMO) from UNESCO heritage sites to promote the sites or to improve the visitors' experience. Reviews published in 2017 and 2018, extracted from two WHS -Coimbra (Portugal) and Salamanca (Spain), were used to assess our methodological approach. The authors selected these cities due to their proximity, familiarity, native languages, markets' importance and place of residence.
The following research questions guided this study: Q1-Is the UNESCO heritage classification a factor of attraction? Q2-What are the perceptions of visitors? Q3-How can Salamanca and Coimbra take advantage of the visitors' perceptions?
The two cities have common characteristics-both have been university cities since their foundation in the 13th century and both are prominent political and religious centers in each country's history. Their tourist attractiveness depends mostly on these characteristics [17]. Also, both cities are classified as UNESCO heritage cities: Salamanca since 1998 and Coimbra since 2013.
Given Salamanca and Coimbra's geographical location, which are only 304 km apart (a car journey of just over three hours), and their similarity, both cities can be considered direct competitors in terms of tourism destination [17]. Thus, choosing these two cities seems to be appropriate to develop a comparative analysis, especially since some studies point to tourist flows between the two cities [17]. Not only do tourists visit these two cities, but they also visit multiple destinations in Europe in one trip. This is not a new phenomenon-the Grand tour and the Inter-rail are examples of this travel concept-and it is increasingly frequent and accessible to significant segments of the population, not only from Europe, but also from Asia and America. Other changes in travel patterns include a higher rate of individual or independently planned trips (as opposed to group planned), immersion in a specific region, and the growth of thematic trips [18]. These trends are particularly facilitated by online travel planning tools and naturally by OTRs.
OTRs are undoubtedly crucial in the decision-making process when choosing a destination. Consumers, more than ever, rely on choices through the internet. For tourist consumption, this medium is essential in purchasing, selecting destinations, and choosing what to visit at the destination itself [3]. Indeed, recent studies in marketing have been dedicated to exploring the influence of OTR on future travelers, from the perspective of those who will travel and how they interpret and assimilate the review contents [19]. Less frequent are studies focused on the reviews themselves or on reviews assessing visited places [20]. From the tourist product and evaluation perspective, most of the analyses have typically focused on OTRs related to hotels and published on Booking.com or TripAdvisor. This paper follows the following structure: the next section summarizes the advantages of integrating the UNESCO List and briefly describes Salamanca and Coimbra as tourism destinations. This section is followed by a review of Web 2.0, eWOM (Electronic Word of Mouth), the TripAdvisor social media, and its importance for tourists and contemporary travelers. The following section presents the methodology, and the various methods proposed are illustrated with some results. This is then followed by two sections on the findings and conclusions.

UNESCO World Heritage List, Coimbra and Salamanca
Several studies show that the UNESCO listed heritage causes increased tourist flow and often brings more support from national and local governments for preservation or maintenance [21]. However, this support may vary according to different national contexts. In many cases, it may be a blessing in disguise [22]. One of UNESCO's objectives is to implement unique visiting experiences while simultaneously creating social and economic benefits for the places themselves [2,23]. Nonetheless, other impacts exist, such as changes in access to places or different uses of those places. Thus, it becomes necessary to systematically assess UNESCO heritage's effects to manage the sites, namely the construction of appropriate indicators, since UNESCO WHS's applications, especially since the 1990s, had as the primary objective the increase in visitor and tourism revenue [21,22]. However, studies that quantitatively measure economic impacts appear to be inconclusive [21,24].
In contrast, other studies highlight UNESCO's listing limited role and argue that the impact of being listed as world heritage depends on other contextual issues. For example, Prud'homme [25] concluded that there is an exaggerated perception of the impacts on local development. Results from the report by Research Consulting Ltd. & Trends Business Research Ltd. [26] confirm Prud'homme's findings, stating that 70% to 80% of the places appear to be doing little or nothing with the UNESCO listing to achieve significant economic impacts. For instance, Coimbra witnessed an immediate increase in visits to the University, but local businesspeople are not unanimous about the earned benefits [27]. Therefore, it would be relevant to explore if the "comparative advantage" based on Coimbra's and Salamanca's "outstanding cultural endowment included in the World Heritage List can transform their initial advantage into a competitive advantage" [28]. According to Poria, Butler, and Airey [29], the two most common reasons reported in the literature to visit a heritage site are education (i.e., the tourists' willingness to learn) and entertainment (i.e., the tourists' desire to be entertained). However, Poria et al. identified other reasons for visiting heritage sites, such as 'heritage experience', 'learning experience', and 'recreational experience', which are linked to the tourists' perception of the site in relation to their own heritage and their willingness to Sustainability 2020, 12, 9654 4 of 19 be exposed to an emotional experience. In a different study, Poria, Reichel, and Biran [30] identified five main motives for visiting a heritage site: learning, connecting with my heritage, leisure pursuit, bequeathing for children, and emotional involvement. This study suggested that heritage sites' managers and marketers should provide different tourists with different experiences to enhance the emotional involvement that visitors may feel at the visited site. Also, as tourist perceptions of a site may be associated with identifiable visitor characteristics (for instance, religion or nationality), this could help management identify those who perceive the site to be part of their own heritage.
The prestige associated with being listed, financial support, and access to specialized knowledge for preservation and conservation are essential benefits of the UNESCO WHS' brand. In addition, prestige acts as a catalyst for preserving heritage by governments and citizens alike [31]. The UNESCO World Heritage 'brand' value appears to be quite relevant in certain countries and communities. For example, in countries such as Portugal, the application and subsequent selection process is closely monitored by the national media and is of significant importance for enhancing cultural and national identity.
Salamanca is a medium-sized Spanish city in the Iberian Peninsula interior, with 144,949 inhabitants [35]. In 1988 Salamanca was listed as UNESCO World Heritage and, in 2002, was branded as the European City of Culture. These events strongly contributed to the development of tourism, which was supported by the unique value of the city's unique architecture, with Roman, Medieval, Renaissance, and Baroque heritage, and of which the cathedral, the Plaza Mayor, and the University stand out [36]. In 2017, there was a total of 1,103,176 overnight stays [37].

User Generated Content (UGC) and Electronic Word of Mouth (eWOM)
Web 2.0 refers to the transition between the static HTML pages that emerged with Web 1.0 to the dynamic and interactive web pages. Web 2.0, also called Social Web or Social Era, consists of a participatory, interactive, and dynamic internet, focused on the consumer, whose contents are generated by the users (User Generated Content or UGC), that is, the users of the information are simultaneously consumers [38]. In prosumers' role, consumer participation, i.e., as active consumers and producers of information, results in UGC and opens new dimensions of communication [39]. According to [3], UGCs are considered more reliable than official sources because they are considered genuine and not focused on business.
OTRs are UGC forms that reflect evaluations and reviews about the visitor's own experience and destination. UGC can equate eWOM (Electronic Word of Mouth) and are crucial in shaping destinations' image and reputation [3]. When a destination exceeds visitors' expectations, they are likely to be motivated to share their positive experiences with others. On the other hand, disappointed visitors can use negative eWOM to spread their negative emotions. For this reason, tourism managers and stakeholders should take into consideration online reputation for understanding what visitors write about the products/services and brand traits. As Buhalis and Inversini ( [39], p. 25) state, "online reputation should be understood as an asset for companies and organizations. Tourism managers should start to exploit all the information available online to boost tourists' experiences". Indeed, gathering and analyzing information from the tourist's perspective is imperative to understand the tourist's experiences and a crucial source for designing new experiences and improving existing ones [40].
Interpersonal influence and eWOM are classified as the most relevant information sources when consumers are in the decision-making process of making a purchase. This influence can be especially significant in the tourism and hospitality sector, as its intangible products are difficult to Sustainability 2020, 12, 9654 5 of 19 assess before consumption [41]. Collaborative and interactive platforms, e.g., 2.0 platforms, offer an accessible medium-where people can express their opinion and suggestions on any subject and where they can become co-creators-to create a space for co-creation between entities and audiences [42]. Viral marketing is especially persuasive in the tourism and hotel industry, as users of 2.0 platforms are more willing to accept and trust information from people who are similar to them. Thus, viral marketing allows OTRs to reach a broader audience [19].
The widespread use of 2.0 platforms is currently causing radical changes in promoting tourist destinations through the clear strategy of incorporating UGC. Consequently, OTRs represent a precious source of information for both travel agencies and DMOs. By analyzing reviews and evaluations by visitors, DMOs adjust and adapt the promotion of tourist destinations to the new challenges posed by this new paradigm of interaction and communication [43].
UNESCO world heritage destinations are not different from other destinations and rely on eWOM to promote themselves. For example, Bergel and Brock [44] recognize that engaged customers can help promote heritage sites and influence customers' pay perception. Also, Mehmood et al. [45] acknowledge that good sustainable practices in heritage sites improve eWOM, positively influencing the site's image and tourists' travel intention to visit the site. A further example of Luli and Kawano [46] shows that employing data mining/analytics techniques to TripAdvisor data helps uncover the psychological impact of the technological intervention on the memorial heritage's survival.
With Web 2.0, tourists or visitors present increasingly documented OTRs, with an online story-telling process that seems to help them improve their experiences. An illustrative case is the study developed by Rahmani, Gnoth, and Mather [47], which uses Corpus Linguistics (in ICT, this technique is called Natural Language Processing or Text Mining, see below) to extract the "hidden" meanings in OTRs and identify their characteristics, issues explored below.
Introduced in 2000, TripAdvisor is one of the most used 2.0 platforms today, operating as an online travel guide that offers user-generated reviews on travel-related content. In 2018, it generated approximately 730 million reviews and user reviews, covering more than eight million ads for restaurants, hotels, vacation rentals, and attractions. It has 490 million unique visitors and sends approximately 80 million emails per week [48]. This platform allows users to search for hotels, activities, restaurants, flights, cruises, cars for rent, among other categories. Among the criteria applicable to would-be reviewers, TripAdvisor advises that reviews offer reliable advice from real travelers [49]. TripAdvisor advice is based on the fact that users' content must be based on "firsthand" experiences and provide a substantial contribution to the issue at hand.

Mixed Methods Approach
The use of a plurality of methods from different disciplinary areas (ICT, Tourism, and Linguistics) adapts well to research topics in the social sciences that are subjective, complex, and multidimensional. These methods allow access to the OTR's textual richness to obtain different but complementary results [50], as is the present study's case. The complementarity of the methods used also offers a more practical approach and confirms the reliability of the results [14]. Since the perspective of both the quantitative and qualitative components of reviews provides a more comprehensive understanding of people's opinions [51], in the following sections, we present the various applied techniques employed to collect and analyze both components and the respective findings.
The use of a mixed methods approach from Data Science, especially Text Mining, Data Mining, and Machine learning, also makes it possible for less biased and more consistent analysis, uncovering unknown patterns and trends. While the algorithms employ the same criteria in all analyzed texts, a human analyst can hardly maintain the same standards and objectives over time. This human fragility is even more noticeable when the volume of texts requires multiple analysts.

Data Extraction and Description
To establish comparisons, we decided to select in Coimbra and Salamanca, respectively, the ten most recommended places for tourists by tourist information sites or digital newspapers with sections dedicated to tourism. For data description and analysis, the terms visitor, tourist, or user are used interchangeably.
The data set was extracted from TripAdvisor in April 2019 through a web robot, or simply bot, an application that performs tasks automatically [52]. With this bot, explicitly built for the C# programming language, 8,638 OTRs were extracted, relative to the reviews published in 2017 and 2018. The frequency and distribution of the OTRs extracted by city and language can be seen in Table 1. The data set on an Excel file was organized according to the following labels: City: name of the city (Coimbra or Salamanca); FullText: text written by the user; GlobalRating: quantitative rating for each location at the date of extraction, on a scale from 1 to 5, with 3 as the average value and 5 being excellent [53]; Language: language in which the review was written; Location: user's place of residence (registered on TripAdvisor); Name: user name. In order to post content and reviews, users have to create a profile on TripAdvisor. This name must be associated with a name/pseudonym; PublishDate: date in which the review was published; ReviewRating: quantitative rating attributed by the user; SiteDesignation: name of the site.

Data Analysis
Although OTRs have two evaluation components, the quantitative ratings and the qualitative text written by the user [54], most research on OTRs focuses solely on the quantitative component, even though the qualitative component can provide a richer overview of OTRs [55]. For this reason, to analyze both components of the data, it was decided to use a diverse set of methods, such as Data Mining, Association Rules, Natural Language Processing, Text Mining, and Textual Analysis. This last analytical procedure is subsidiary to the others and supported the interpretation of the data.
Through database manipulation, machine learning, statistics, and data visualization, among other methods from data science and information technology, Data Mining allows you to find patterns in structured data and extract information, transforming information into knowledge [56,57]. Other methods such as Natural Language Processing, Text Mining, and Textual Analysis enable information and knowledge extraction from unstructured data, as is the case with the qualitative component of OTRs. On the one hand, Text Mining comprises a set of techniques to characterize and transform the text using the words themselves as units of analysis (for example, frequencies, distribution, or presence/absence of specific terms). On the other hand, Natural Language Processing algorithms use syntactic and/or semantic processing based on statistical rules or methods to analyze, segment and extract information [58][59][60]. The data analysis included applying a set of sequential procedures to arrive at the extraction of information. This process, represented in Figure 1, involved applying several R packages [61].
Sustainability 2020, 12, x FOR PEER REVIEW 7 of 19 characterize and transform the text using the words themselves as units of analysis (for example, frequencies, distribution, or presence/absence of specific terms). On the other hand, Natural Language Processing algorithms use syntactic and/or semantic processing based on statistical rules or methods to analyze, segment and extract information [58][59][60]. The data analysis included applying a set of sequential procedures to arrive at the extraction of information. This process, represented in Figure 1, involved applying several R packages [61]. The first procedure was creating a category for each review with the respective user's country of residence (procedure 1 in Figure 1). Although there is a field for Location, this is an open text box, which is not always filled in by users. Some just write the name of the city. Others complement the city name with the state (common in users from the USA and Canada). An added complexity to this procedure is that the language used (in our case, English, Portuguese, or Spanish) will change spelling (e.g., Spain, España, or Espanha). Therefore, to identify the country of residence, this first procedure involved applying Natural Language Processing techniques to divide the Location field text into separate words (tokenization). Then, using another set of data with the list ISO-3166 [62] with country designations in English, Portuguese, and Spanish, the user's country was identified. This identification was accomplished based on the country's designation and, when this was not possible, by comparing it against a list of cities and states created manually. This procedure allowed the country to be identified in 5,675 reviews (approximately 66% of the total).
The following procedure (procedure 2 in Figure 1) involved a set of text annotation tasks such as counting word frequency (known as TF -Term Frequency) and calculating co-occurrences to explore the qualitative (textual) component of the reviews. With the application of the R package "udpipe" [63], it was possible to perform four common tasks in Natural Language Processing [59]: (1) "tokenization" (dividing the text into units -words and punctuation); (2) "part of speech tagging"identifying the grammatical form of words based on the definition of the word and the context; (3) "Lemmatization" -reduction of the word to its lemma (i.e., the canonical form or the most common form of the word); (4) Dependency analysis -analysis of the grammatical structure of the sentences, identifying the "main" words and the relationships between "main" words and the other words.
The third procedure (number 3 in Figure 1) involved applying sentiment analysis to the texts. The analysis of feelings (or opinion analysis or opinion mining as it is also sometimes called) is the computational study of opinions about entities, individuals, events, topics, and attributes. Opinion The first procedure was creating a category for each review with the respective user's country of residence (procedure 1 in Figure 1). Although there is a field for Location, this is an open text box, which is not always filled in by users. Some just write the name of the city. Others complement the city name with the state (common in users from the USA and Canada). An added complexity to this procedure is that the language used (in our case, English, Portuguese, or Spanish) will change spelling (e.g., Spain, España, or Espanha). Therefore, to identify the country of residence, this first procedure involved applying Natural Language Processing techniques to divide the Location field text into separate words (tokenization). Then, using another set of data with the list ISO-3166 [62] with country designations in English, Portuguese, and Spanish, the user's country was identified. This identification was accomplished based on the country's designation and, when this was not possible, by comparing it against a list of cities and states created manually. This procedure allowed the country to be identified in 5675 reviews (approximately 66% of the total).
The following procedure (procedure 2 in Figure 1) involved a set of text annotation tasks such as counting word frequency (known as TF-Term Frequency) and calculating co-occurrences to explore the qualitative (textual) component of the reviews. With the application of the R package "udpipe" [63], it was possible to perform four common tasks in Natural Language Processing [59]: (1) "tokenization" (dividing the text into units-words and punctuation); (2) "part of speech tagging"-identifying the grammatical form of words based on the definition of the word and the context; (3) "Lemmatization"-reduction of the word to its lemma (i.e., the canonical form or the most common form of the word); (4) Dependency analysis-analysis of the grammatical structure of the sentences, identifying the "main" words and the relationships between "main" words and the other words.
The third procedure (number 3 in Figure 1) involved applying sentiment analysis to the texts. The analysis of feelings (or opinion analysis or opinion mining as it is also sometimes called) is the computational study of opinions about entities, individuals, events, topics, and attributes. Opinion mining allows quantifying opinions according to their polarity (positive, negative, or neutral) [64]. This access to users' opinions on hotels, restaurants, and tourist attractions is considered essential for developing customer service strategies and hence its increasing use in tourism management [65]. Sentiment analysis was performed with the R extension package, "Sentiment Analysis" [66]. For the English reviews, the package's base dictionary was used; for Spanish, the ElhPolar dictionary [67] and for Portuguese, the SentiLex-PT 02 dictionary [68]. The analysis was performed by sentences (taking advantage of the text annotation from the previous procedure), applying the "ruleSentimentPolarity" method, which assigns a value between −1 and 1 to each review, being −1, very negative and 1, very positive. The score for each review was calculated based on the average of the values of the respective sentences.
The fourth procedure (number 4 in Figure 1) involved applying another Natural Language Processing technique, which consists of keyword extraction-a sequence of one or more words that provide a compact representation of a document content [69]. The implementation of the RAKE algorithm [63] available in the R package "udpipe" [63] was used to extract the keywords. In simple terms, it can be said that this algorithm selects keywords by a three-step procedure: (1) selection of words not delimited by "stopwords" (non-semantic words which do not add meaning); (2) construction of a co-occurrence matrix of words; (3) calculation of a score according to co-occurrence.
After these four procedures, the analysis proceeded, using statistics, data visualization, network analysis, market basket analysis, text mining, and textual analysis. The results are presented in the following section.

Data Mining
The 8638 reviews were written by 4695 users, which gives an average of 1.7 reviews per user, i.e., some single users posted more than one review. Even though the median was one review per user, the 3rd quartile consists of two reviews per user, with eleven users posting ten or more reviews. There was even one user who published fourteen reviews during the timespan and for the selected places under analysis. This user (a man from Spain), registered as someone between 50 and 64 years old and who had visited 100 cities so far, always wrote in Spanish, and had, at the time of data extraction, a total of 471 reviews published on TripAdvisor. Of the fourteen reviews he published on Coimbra and Salamanca, three reviews were published on Salamanca sites on 2/03/2018, 8/03/2018, and 27/07/2018. Finally, he published nine reviews about places in Coimbra on two consecutive days: 20 and 21/08/2018. Data analysis by user and city showed that 3203 users published reviews from Salamanca, while 1863 users published reviews from Coimbra, revealing that 101 users posted reviews from the two cities.
Additionally, the network analysis of users who posted reviews about more than one attraction, as displayed in the arc plot of Figure 2, clearly depicts visitors' preferences about which attractions to visit. In Figure 2, the size of the nodes represents the total number of reviews for the attraction. The arc width represents the number of visitors who posted reviews in both connecting attractions. We also employed the Apriori algorithm to discover association rules between visitors' reviews. The Apriori algorithm is typically employed in the retail industry to apply a "Market Basket Analysis", i.e., to uncover associations between items. The algorithm analyzes the items purchased together (itemset) in each transaction [70]. Also, the algorithm identifies rules, that is, associations of items that are bought together. For example, the rule "Product A + Product B => Product C" can be read as "Product A and product B are commonly purchased with product C". "Product A + Product B" is named the "Left-hand side" (LHS) of the rule, and "Product C" is called "Right-hand side" (RHS).
In this context of attractions visits, we considered each visitant's review as one transaction and applied the algorithm to understand the most common associations of reviews. Although the uncovered associations only reflect the patterns of visitors who posted multiple reviews on TripAdvisor and cannot be generalized to all visitors, it shows patterns of visitors' paths in the cities.
The rules uncovered by the algorithm are measured with three metrics: Support: the proportion of all transactions with both LHS and RHS (e.g., the percentage of visitors who posted reviews about Catedral Vieja and Plaza Mayor).
The formula is ⋂ (1) Lift: the factor by which the co-occurrence of LHS and RHS exceeds the expected probability of both co-occurring. Therefore, the higher the lift, the higher the probability of LHS and RHS occurring together. A lift value less than 1 means that LHS and RHS are probably mutually exclusive or even substitutes.
The formula is: (2) Network analysis and association rules allow a quantitative analysis of the attractions where visitors stopover. Table 2 shows the top five association rules for each city, in which some prevalent association rules can be observed. For example, for Salamanca, the two top rules account for more than 10% of the visitors, showing that 10% who visited Plaza Mayor also visited the Catedral Vieja.  We also employed the Apriori algorithm to discover association rules between visitors' reviews. The Apriori algorithm is typically employed in the retail industry to apply a "Market Basket Analysis", i.e., to uncover associations between items. The algorithm analyzes the items purchased together (itemset) in each transaction [70]. Also, the algorithm identifies rules, that is, associations of items that are bought together. For example, the rule "Product A + Product B => Product C" can be read as "Product A and product B are commonly purchased with product C". "Product A + Product B" is named the "Left-hand side" (LHS) of the rule, and "Product C" is called "Right-hand side" (RHS).
In this context of attractions visits, we considered each visitant's review as one transaction and applied the algorithm to understand the most common associations of reviews. Although the uncovered associations only reflect the patterns of visitors who posted multiple reviews on TripAdvisor and cannot be generalized to all visitors, it shows patterns of visitors' paths in the cities.
Lift: the factor by which the co-occurrence of LHS and RHS exceeds the expected probability of both co-occurring. Therefore, the higher the lift, the higher the probability of LHS and RHS occurring together. A lift value less than 1 means that LHS and RHS are probably mutually exclusive or even substitutes.
Network analysis and association rules allow a quantitative analysis of the attractions where visitors stopover. Table 2 shows the top five association rules for each city, in which some prevalent association rules can be observed. For example, for Salamanca, the two top rules account for more than 10% of the visitors, showing that 10% who visited Plaza Mayor also visited the Catedral Vieja. As expected, when taking into account the users' country of residence (see above), the vast majority of reviews are made by users from Portugal and Spain (Figure 3 shows all countries with more than 20 reviews). However, Table 3 also demonstrates that there is a different distribution in the two cities. While in Coimbra the number of reviews from foreigners is higher than the number of reviews from Portuguese visitors, and there are even more reviews from users residing in Spain than users living in Portugal, for Salamanca, the reviews from users living in Spain are clearly higher than the reviews of the users living in other countries.

Natural Language Processing and Text Mining
The application of Natural Language Processing and Text Mining methods to the reviews' qualitative component also revealed interesting results. For example, the frequency count showed that in the 2,992 reviews written about Coimbra attractions, the word "Salamanca" was mentioned four times. In turn, in the 5,646 reviews written about Salamanca's attractions, the word "Coimbra" was mentioned seven times. Below is an example of these references (in the original language) with gender, city of residence, age (when available), and date of review: Extract 1 -"(…) The University of Salamanca is a treasure of the world, not just Spain. For us, it  The analysis of review frequencies by location and language (Table 1 and Figure 3) shows that the number of reviews in Spanish (4362) is almost twice the number of reviews in English and Portuguese, respectively 2046 and 2230. In turn, as Table 4 illustrates, there is also a high amplitude in the frequency of reviews by location. Salamanca's case stands out, where there is one place with only sixteen reviews, but the most visited has 2146. In Coimbra, only two places, Biblioteca Joanina and Universidade have more reviews than the city average for all places (299.2), respectively, 1141 and 859. In Salamanca, only three places have more reviews than the city average (564.6). These are Plaza Mayor (2146), Casco Histórico (902), and Catedral Vieja (653).

Natural Language Processing and Text Mining
The application of Natural Language Processing and Text Mining methods to the reviews' qualitative component also revealed interesting results. For example, the frequency count showed that in the 2992 reviews written about Coimbra attractions, the word "Salamanca" was mentioned four times. In turn, in the 5646 reviews written about Salamanca's attractions, the word "Coimbra" was mentioned seven times. Below is an example of these references (in the original language) with gender, city of residence, age (when available), and date of review: Extract 1-"( . . . ) The University of Salamanca is a treasure of the world, not just Spain. For us, it proudly joins the family of Oxford, Cambridge, Sorbonne, Heidelberg, and Coimbra Universities, which we had visited in the past and greatly admired." (from Sudbury, Massachusetts, 02-07-2017) [our emphasis].
As described in OTR studies, Sentiment Analysis shows differences in the OTR's qualitative components depending on the language used (as in the quantitative component) [45]. Users who write reviews in Portuguese are those who, on average, produce a better quantitative assessment. However, when analyzing the sentiment, we notice that, for two tenths, this is not true for Salamanca. Here, users who write in Spanish present the higher average. Conversely, the English written reviews present the lowest average ratings, except for the quantitative assessment of Coimbra (Figure 4). However, these averages should be analyzed with some caution, as shown by [71], because Sentiment Analysis identifies discrepancies between quantitative and qualitative evaluations. See, for example, the review below in which the two evaluations show conflicting results:  The search for words and terms in the three languages related to the UNESCO heritage shows that the word UNESCO is mentioned more often in Coimbra than in Salamanca, but in both cases, unexpectedly in a very low proportion of the reviews (Table 5). Interestingly, words linked to heritage (e.g., "history", "architecture") are present in several reviews. Text annotation also made it possible to check that the adjectives (lemmas of the adjectives, i.e., words as used in dictionaries) in the reviews do not differ from language to language, nor by city (see Figure 5). In fact, for Coimbra, the two most frequent adjectives in the three languages are semantically equivalent. Extract 2-"Tenia muchas ganas de fotografiar por la noche el puente romano y la catedral, cuando acudí comenzaba la iluminacion nocturna, una mezcla de luz increible" (which translated to English means "I really wanted to photograph the Roman bridge and the cathedral at night, when I went there the night lighting began, an incredible mixture of light") (resident in Provincia de Ciudad Real, Spain; Catedral Nueva, Salamanca, 15.05.2017. Rating: 1, Sentiment: 5).
The search for words and terms in the three languages related to the UNESCO heritage shows that the word UNESCO is mentioned more often in Coimbra than in Salamanca, but in both cases, unexpectedly in a very low proportion of the reviews (Table 5). Interestingly, words linked to heritage (e.g., "history", "architecture") are present in several reviews. Text annotation also made it possible to check that the adjectives (lemmas of the adjectives, i.e., words as used in dictionaries) in the reviews do not differ from language to language, nor by city (see Figure 5). In fact, for Coimbra, the two most frequent adjectives in the three languages are semantically equivalent. The extraction of keywords through the RAKE method also provided some interesting findings. Figure 6 shows how the keywords differ by language and city and how many keywords are related to history and heritage, ranging from various references to architectural styles, names of kings, and places to references to specific centuries. However, the main surprise was the keyword with the highest RAKE index, for Coimbra, in English: "Harry Potter". In fact, the subsequent textual analysis of the reviews show that the visiting The extraction of keywords through the RAKE method also provided some interesting findings. Figure 6 shows how the keywords differ by language and city and how many keywords are related to history and heritage, ranging from various references to architectural styles, names of kings, and places to references to specific centuries. The extraction of keywords through the RAKE method also provided some interesting findings. Figure 6 shows how the keywords differ by language and city and how many keywords are related to history and heritage, ranging from various references to architectural styles, names of kings, and places to references to specific centuries. However, the main surprise was the keyword with the highest RAKE index, for Coimbra, in English: "Harry Potter". In fact, the subsequent textual analysis of the reviews show that the visiting However, the main surprise was the keyword with the highest RAKE index, for Coimbra, in English: "Harry Potter". In fact, the subsequent textual analysis of the reviews show that the visiting users observe the Biblioteca Joanina, the streets of Coimbra, or the University space from the perspective of the "Harry Potter" universe (which is a fictional main character in a series of seven children's and youth novels written in English by the British J.K Rowling (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007). The novels were also adapted for cinema): Extract 3-"Enjoyed our walk through the narrow streets of historical Coimbra to visit the university. ( . . . ) An inspiration for J.K. Rowlings and the Harry Potter novels." (resident in Jerusalem, Israel; Jardim Botânico, Coimbra, 01.08.2017).

Discussion of Results
The analysis carried out using both quantitative and qualitative methods shows that some locations collected more reviews than others. In Coimbra, the number of reviews from foreigners compared to Portuguese nationals is higher, while in Salamanca, the number of reviews of users residing in Spain and in Spanish are higher than other users. These results indicate that Coimbra receives proportionally more foreign visitors than Salamanca. Thus, two explanatory hypotheses can be put forward: either Coimbra is promoting itself adequately in the foreign market or is not promoting itself well within the national market. The hypotheses could also be true inversely for Salamanca: the city promotes itself adequately among the national public or promotes itself poorly among international visitors. Finally, it appears that most visitors come from European or Latin American countries.
The study of associations can enable DMO to understand visiting patterns better and implement measures to improve the tourists' experiences, such as (1) introducing the visitors of one city to the other city; since we found common visitors to both cities, some of which who actually compare them, one could hypothesize that both cities cater for similar tourist "targets", therefore cross-marketing could be employed to present the other city to visitors of one city; (2) when visiting the more popular attractions, visitors could be encouraged to check the less popular ones. This DMO could mitigate the effects of overtourism in the most popular attractions and help visitors get a more wholesome experience.
One of our research questions asked if becoming listed as UNESCO heritage was an attraction factor for visitors. We were able to ascertain, using the RAKE method, that the word "UNESCO" or the equivalent phrase, "world heritage", seldomly occurred; however, the words "architecture" and "history" appeared more frequently. These results seem to indicate that among the four WHS reasons for applying to the UNESCO list shown by Research Consulting Ltd. & Trends Business Research Ltd. [26]: celebration, SOS designation, brand for marketing/logo, and catalyst for building space. Clearly, the brand category is not being worked on either in Coimbra or Salamanca, and, based on our corpus, we could say that it does not seem to be an attraction factor for visits. This result helps to clarify the answer to the question formulated by Remoaldo, Vareiro, Ribeiro, and Marques [72] "Do the cities declared by UNESCO as World Heritage Sites have an outstanding tourist competitive advantage over the ones not benefiting from such a label?". In conclusion, the UNESCO reputation or brand does not seem to be appropriately explored.
As mentioned above, the analysis revealed the transposition of the Harry Potter fiction universe to the ancient space of Coimbra or, in other words, the evocation of the Harry Potter universe from the immersion in the Coimbra environment, at least for visitors who wrote reviews in English. This phenomenon, common in some works of fiction could be used by those responsible for promoting the city of Coimbra to sponsor it among tourists.

Conclusions
It is not possible to understand modern heritage tourism without considering the people who 'consume' heritage [73]. Following that logic, this study explored the way people 'consume' the heritage of two UNESCO listed cities using a corpus collected from what people themselves write and post about that same 'consumption'. From a methodological perspective, this study contributes to the tourism literature on heritage studies by demonstrating how a data science approach that uses natural language processing, machine learning, and other methods can be used to explore OTRs in order to gain a clearer picture and understanding of what kind of things visitors value and want to find in their experience. Even though our study is not unique in some of its methodological choices (see, for instance, how Rodrigues et al. rely on UGC produced by tourists to identify thermal and spa's attractiveness [16]), we believe it may be used as a methodological framework to understand the applicability of various methods to analyze OTRs of UNESCO World Heritage sites. The study showed how the various techniques could complement each other and the ability to establish standardized information on the expectations, perceptions, and appraisals of visitors, who are simultaneously users of TripAdvisor. Also, we sought to highlight the advantages of applying various methods of analysis to a comparative study and demonstrate the informational wealth of OTRs as an instrument to (re) position the destination.
Theoretically, we believe this study uncovers some weaknesses in terms of the World Heritage List recognition by tourists or visitors to Salamanca and Coimbra, even though econometric studies show that becoming included among the sites on the World Heritage List could have the stimulating effect of promoting tourism (namely in China) [74]. Nonetheless, other studies seem to question the World Heritage List's direct impact on tourism attraction [75]. Indeed, our findings showing unexpected few occurrences of the UNESCO acronym (or synonyms) in the OTRs demand further studies to explore this link better. This finding could mean that the brand is not living up to its promise [76] and, therefore, destination marketing organizations should focus on enhancing the UNESCO 'brand'. Empirically, it is possible to state that analyzing these particular OTRs about Coimbra and Salamanca revealed the semantic similarity of adjectives across the three different languages and the two cities. However, when applying the RAKE method, the keywords were no longer that similar, and we found that many are related to history and heritage, ranging from various references to architectural styles, names of kings, and places. Additionally, the keyword with the highest RAKE index, for Coimbra, in English was "Harry Potter", which was rather unexpected. This presents interesting challenges in terms of investigating further the language and discourse of OTRs, as well as provides useful insights to be used by DMOs in Coimbra. This study is not without limitations. First, it only focuses on two cities. It could be useful to replicate it with other listed cities from different geographic regions. This replication would expand our understanding of what tourists/users want and value, particularly relevant to promote cities, leverage economic development, and contribute to strength place identity. Second, UNESCO World Heritage tourists are male and female in equal numbers, are highly schooled, are employed, and travel in small groups (two to five people) [8]. Although not tested in our analysis, these demographics could help explain the precise and relatively sophisticated textual construction of the collected OTRs and the frequent analogies and cultural references to other places, issues that will be explored in the future. We believe we have proven our study's relevance and its potential for future research, namely for exploring differences in OTR discursive and rhetorical patterns. In particular, the relevance of multilingual studies, which have been underexplored, despite the growing importance of intercultural issues in computer-mediated communication [77].