1. Introduction
Travel behavioral pattern analysis is important for the planning and management of tourism destinations and attractions, allowing managers to more effectively develop strategies, map out travel routes, recommend products and experiences, and manage visitor impacts [
1]. Travel blogs on social media are an excellent information source for analyzing tourist movements, activities, preferences, and satisfaction levels [
2]. However, these data are not often being applied in scenic area planning in China. There is a tendency to focus just on entry tickets sold, revenue generation, and volumes of tourists, rather than on gathering and analyzing robust data on travel patterns and tourist behavior. Tourism planning in most Chinese scenic areas focuses on the levels of financial investment and GDP increases, but generally ignores tourist services and smart management [
3]. The Jiuzhai Valley National Park in Sichuan was the first scenic area with a smart management system in China [
1,
4,
5]. However, despite being an innovator, it was forced to launch a tourist flow forecasting system in 2014 to prevent a repeat of an overcrowding crisis during China’s Golden Week in 2013 [
6]. Even very well-known attractions in China such as Huashan are lacking basic data, including detailed statistics on the origins of tourists. Smart tourism strategies based on big-data analysis will undoubtedly contribute to solving these information deficiencies.
Huashan (simplified Chinese:华山) in Shaanxi Province is a popular scenic area and was selected as the case study for this research. Huashan is a mountain situated in Huayin in the Weinan region of Shaanxi Province, which is 120 km from Xi’an [
7]. It is located near the southeast corner of the Ordos Loop of the Yellow River Basin, south of the Wei River Valley, at the eastern end of the Qin Mountains, in southern Shaanxi Province. It is the most western of the Five Great Mountains of China, and has a long history of religious significance. Huashan has five main peaks, the highest being the South Peak at 2155 m (7070 feet). Huashan can be classified as a mature and world-class tourist destination [
8,
9]. The number of tourists visiting Huashan has been growing steadily; for example, in 2016, total tourist arrivals were 26.2 million and ticket income was RMB 339 million Yuan, representing increases of 5% and 8%, respectively, over 2015 [
10]. During the field study at Huashan in May 2016, although it is a leader in Weinan’s tourism sector, the Huashan Tourism Group could not provide detailed statistical data on tourists other than tourist arrivals based on ticket counts and information provided by local travel agents and hotels in Weinan. Huashan’s smart management system has existed since 2014 [
11]; however, the focus is on providing online travel information and generating e-commerce. The system is lacking fine-grained statistics on tourists’ behavior, so that it cannot provide data support for intelligent services and further planning. Due to the steep terrain of Huashan, overcrowding during peak attendance periods leads to falls and trampling. Huashan needs to invest great care, time, and effort to ensure the safety of its visitors. As with most of China’s state-run scenic areas, Huashan does not feel compelled to significantly enhance visitor services while ticket revenues and profits are high. Most critically, the administration team at Huashan lacks research on tourist satisfaction. The absence of these data constrains intelligent management, marketing, and the sustainable development of Huashan and the surrounding region of Weinan. Paradoxically, due to the popularity of Huashan among tourists, there are many travel blogs on Huashan in Chinese social media, which are waiting to be mined to profile tourist behavior patterns and satisfaction levels. Recognizing the potential for user-generated content, travel blogs uploaded by Huashan tourists were analyzed to document travel movements, site linkages, and satisfaction levels. The principal research questions were:
How do Huashan visitors describe their travel experiences in blogs?
What sites are visited within the Huashan scenic area?
What are the patterns of movement within Huashan and adjoining destinations?
Are people satisfied with their experiences at Huashan? If tourists are dissatisfied, what are the reasons?
What are the geographic origins of Huashan tourists?
What are the monthly distributions of visits, expenditures, and lengths of stay for visitors to Huashan?
Answering these questions by analyzing travel blogs for Huashan is potentially a smart tourism solution leading to more effective scenic area planning and management. It may also contribute ideas and solutions to enhancing the sustainability of the scenic area.
3. Methods
3.1. Data Collection
There is a great quantity of user-generated content on Chinese mainstream online travel websites such as Mafengwo, Baidu Tourism, and Ctrip. The data include not only text and photos in travel blogs, but also tagged data such as travel dates, travel expenses, lengths of stay, associated destinations, and author residences. These data are gradually becoming easier to obtain with steady improvements of data structure in the travel websites. Several commercial software programs for web information acquisition are available, such as GooSeeker, Enthone, and the Locomotive and the Octopus web crawlers. These programs generally have the advantages of rapid iteration and ease of operation, making UGC acquisition from travel websites more convenient.
The Octopus web crawler tool was used to capture data from the Travel Guide Channels of Mafengwo and Ctrip on 20 May 2016. Two new tasks were created in the capture software and then a complete and clear capture process was established. First, all the lists of travel blogs related to Huashan were obtained by searching the home pages of the Travel Guide Channels using “Huashan” as the keyword. Next, a circular crawl list was created to catch the detailed pages of each travel blog. Then, in each detailed page, different grabbing positions were set up according to the page structure to obtain the corresponding contents, such as title, full text, release time, and tourist behavior. Finally, with the automatic page-turning function of the software, all relevant travel materials were obtained.
A total of 1468 travel blogs (over 840,000 words in Chinese) were captured. Among them, 768 (over 58,000 words) were retrieved from Mafengwo, and 700 (over 265,000 words) were retrieved from Ctrip.
3.2. Data Cleaning
The data were saved in a structured format, that is, importing the basic trip elements, including the blog title, author, and full text as well as tourist behavioral information, including travel dates, travel expenditures, lengths of stay, other destinations visited, and author residences into an Excel file to form a database of travel blogs. Data missing basic information (3977 articles with more than 1,900,000 words), travel website template data extracted by the regular expression function of the software (3341 articles, >2,700,000 words), and advertising text data (1305 articles, >800,000 words) were deleted. The pure-text process was conducted on the full text of each travel blog and contents were sorted by sentences. Duplicate and blank content (>20,000 words), short articles with 10 characters and less or meaningless content such as “I am here!”, “This picture is beautiful... “(more than >50,000 words in total) were removed. After sorting and screening, a total of 1080 high-quality Huashan travel blogs (>700,000 words) were obtained. Among them, 549 articles were from Mafengwo (>439,000 words) and 531 were from Ctrip (>265,000 words).
3.3. Data Analysis
To address the first four research questions, content analysis of blogs was conducted. The semantic analysis of blog contents applied ROST CM, NetDraw and other tools for word segmentation and frequency statistics, and semantic structure drawings. A customized lexical pool was created based on the unique vocabulary associated with Huashan, and then integrated with the built-in Chinese word library of ROST CM. For the former, words were included such as “Huashan (华山),” “West Peak (西峰),” “cliff (绝壁),” “sunrise (日出),” “plank walk (栈道),” “Weinan (渭南),” “lamb liver soup (杂肝泡),” and “spicy Chinese food (香椿辣子).”
Figure 1 shows the plank walk at the cliff near the South Peak of Huashan. Then, word segmentation processing on the full-text content of the travel blogs was done. Meaningless words were filtered out and word frequency statistics calculated. Using the highest frequency words, the Word Extraction feature was applied to each sentence line of the travel blogs. A co-occurrence matrix was derived by calculating the total frequencies of all the feature words. This matrix was visualized by the topological graph process using NetDraw to represent the semantic structure.
The emotional analysis of visitor satisfaction and dissatisfaction assessed inclinations within blog text using an emotional word library. After word segmentation, the text was separated into lines according to the ending punctuation marks such as periods, question marks, exclamations, ellipses, etc. The researchers ensured that each line expressed independent and complete meanings. Next, a dictionary of Chinese commendatory and derogatory terms written by Professor Li Jun of Tsinghua University and sentiment words from the China National Knowledge Infrastructure was selected as the basis for the emotional analysis. Common negative Chinese words were used as negative emotional expressions, and common Chinese adverbs represented the emotional judgments. The words in each line of the travel blogs were compared to those in this emotional lexicon. Also, emotional indications were judged according to multiple negation rules in Chinese language habits. A score was assigned according to the degree of emotion expressed by the adverbs. Positive and negative points indicated positive and negative emotions, respectively; zero point scores were neutral. The higher the absolute score, the greater the degree of emotion being expressed by the tourist.
The data analysis produced descriptive statistics on tourist characteristics. Microsoft Office Excel was used to classify, aggregate, cross-analyze, and visualize charts on structural tag data including tourist origin regions, places visited within Huashan scenic area, travel dates and expenditures, lengths of stay, and visits to adjoining destinations. These results satisfied the requirements for research objectives 5 and 6.
5. Conclusions, Management Implications, and Research Limitations
This research is among the first to data-mine scenic area travel blogs by incorporating semantic analysis along with GIS visualizations. It demonstrates the value of these user-generated contents for market and satisfaction analysis of scenic area attractions. It is an exploratory analysis on travel blog data about scenic area attractions and there is considerable scope for future studies. Suggestions include analyzing the photographic content of travel blogs; conducting preference analyses among different tourist market segments; and cross-validation analysis with data from traditional research methods.
The results show that the tourist experience at Huashan is based on climbing and especially associated with the iconic “plank walk.” Xi’an and Huashan are linked as destinations in the minds and actions of tourists. Specifically, downtown Xi’an, Terracotta Warriors, Huaqing Pool, and Yan’an are often grouped with Huashan in multi-destination trips.
The multi-destination tendency of Huashan tourists underlines the potential for cooperative marketing by the Huashan Management Committee along with the neighboring provinces of Henan, Qinghai, and Gansu. The other closest sites and attractions within Weinan did not appear in multi-destination patterns, which suggests that Huashan is overshadowing neighbors through its much greater destination image and market popularity. The Huashan Management Committee must, therefore, strengthen its role as a tourism development agent for the Weinan region. Greater attention must be focused on regional tourism development and marketing, integrating the tourism resources in Eastern Shaanxi and along the Yellow River.
There is a significant level of dissatisfaction with the facilities, services, and operational management of Huashan, which requires immediate attention. Overcrowding and littering are already serious issues, and will worsen as tourist numbers continue to increase. The sustainability of the Huashan experience is under threat. Visitor monitoring and management are insufficient at the current time; however, smart data-gathering and analysis such as demonstrated in this research can help point to solutions that will improve resource and experience sustainability.
Many attraction administration teams in China still have a narrow “ticket revenue and GDP” mindset and need to broaden their perspectives to operate more professionally as destination managers while assuring the sustainability of precious natural and cultural resources. The Huashan Management Committee should gather and use contemporary information sources, including smartphone ‘footprint’ data, to obtain real-time, spatial data on tourist and personnel movements within the scenic area that impact on the natural resources and environment, traffic flows and convenience of navigation, and visitor safety, experiences, and enjoyment. Managers should be accessing real-time data from big-data centers and cloud computing platforms, as well as analyzing tourist preferences and requirements.
As Gretzel et al. (2015) claimed, the lifeblood of smart tourism is big data, and the final purpose of smart tourism planning is extracting intelligence from big data [
41]. Smart scenic area management will be assisted by technological approaches to gathering, analyzing, and interpreting big data [
41,
42], along with taking care of the human side by providing the types and quality of experiences that visitors are seeking [
3,
42]. This research verified that the results from travel blog data could help reveal tourists’ opinions on services offered [
42] at the area level, although there is a risk of bias by under-representing Huashan visitors who do not post online. Through the development of a smart scenic area system, the administration will be able to monitor tourist flow distribution, traffic conditions, and service facility use in real time. Timely diversion measures can be adopted to ensure the safety, comfort, and enjoyment of tourists. Moreover, service and facility quality must be continuously evaluated and improved based on visitor survey results and observation on usage of facilities and service encounters. Capacity measurement of most popular sites needs immediate attention as overcrowding is spoiling the tourist experience at Huashan.
It is recognized that this is only one example of a famous scenic area in China and the results may not be generalizable to other countries, let alone to other similar destinations in China. However, the research and its analysis can be helpful to protected area managers for smart destination management and promoting sustainability. The combination of qualitative and quantitative techniques applied to a scenic area using traveler blogs is rather unique. It has the potential of providing protected area managers with visitor monitoring and management data that can enhance resource sustainability and visitor satisfaction.
There are some limitations to this research that must be recognized. The research data were all from social media sources and there is a danger that they may be biased in under-representing Huashan visitors who do not post online. Additionally, all tourists were treated alike, and differences in demographics, travel group composition, and arrangements (e.g., independent vs. group tours) were not investigated. It is very important to stress that big data processing methods should be combined with other approaches, rather than being considered an independent method.