A Regression Tree Approach for Investigating the Impact of High Speed Rail on Tourists’ Choices

This paper provides a contribution to the international literature by applying regression tree methods to the analysis of the expected effects of the High Speed Rail project in Italy on the tourism market. This approach, as far as the author knows, has never been applied in this context. Tourism and transport information have been gathered for 99 Italian provinces during the 2006–2016 period. Tree-structured methods have been chosen as an application of regression models in which some explanatory variables are used as covariates to predict the dependent variable values on the basis of some decision rules. This approach establishes a casual effect between dependent and independent variables. The dependent variables chosen are the Italian and foreign tourists, and the number of overnights spent by Italians and foreigners. Among the independent variables are the presence of HSR, the presence of first-level airport hubs and the number of operating bases of low-cost airlines; among the attractiveness variables are the GDP, the number of attractions in a given province, the presence of the sea, the population and the percentage of unemployment. The main outcome of this study is that HSR affects the tourism market.


Introduction
There are many studies analyzing the factors affecting tourists' destination choice and Hunt can be considered as the pioneer of this approach [1]. A tourist destination can be considered as a product made up of natural resources, infrastructures and services, cultural events, etc. [2]. The attributes "shaping" this product are related to its attractiveness, its facilities and to the accessibility provided by transportation system [3]. Specifically, the attractiveness factors generate flows to a given destination. The tourist facilities are fundamental, since their absence prevents tourists from travelling to enjoy these attractions. In general, the accessibility to given tourist destinations is related to the different transport modes available to reach them [4].
New interventions in the transportation system bring an increase in accessibility, which in turn fosters tourism development [5]. Khadarooa and Seetanah stated that the "provision of suitable transport has transformed dead centers of tourist interest into active and prosperous places attracting multitudes of people" [6]. Alternative transport modes have been changing over the centuries according to the development of technology. Since 1964, with the Shinkansen in Japan, the revolution in the transportation sector has been represented by High Speed Rail (HSR). The focus of this manuscript is to investigate whether HSR can induce changes in tourists' behaviour. Guirao and Campa [7] demonstrated that the high costs of new HSRs require a selection methodology to define which HSR corridors, within a network, should be built first, and the most suitable evaluation tool is the multi-criteria approach. Indeed, they showed that, in any corridor-ranking methodology, and especially in countries with high tourism attractiveness, the impacts of HSR on tourism should be considered In Table 1, the total number of tourists visiting Italy during the period 2006-2017 has been reported. There are several papers in the international literature investigating the impact of HSR on tourism, following different approaches [9][10][11]. However, in this paper the methodological approach developed is new to this context, as far as the authors know. A panel dataset for 99 Italian provinces, built for the 2006-2016 period, has been collected, and the impact of the HS/HC rail project in Italy on tourism has been studied through the specification of regression trees.
The structure of the paper is as follows. In Section 2, a literature review on the impacts of HSR on tourism is reported. Section 3 describes the methodology. In Section 4, the case study is presented together with the main results. Section 5 deals with the results, while, in Section 6, the conclusions and further perspectives are shown.

HSR and Tourism: Is There a Link?
For Della Corte et al. [12], a tourist destination is represented by accessibility to a given destination; by attraction factors, by the presence of hotels, by the amenities, by the activity of tour operators and by the agencies providing tours. According to Kaul [5], destinations which are characterised by an efficient transportation system, in terms of infrastructure and services, can experience a development of tourism. Several authors have contributed to the analysis of the relationship between HSR and the tourism market [13][14][15][16][17][18][19][20].
It is important to state, from the very beginning, that the contributions present in the literature provide information on tourism statistics that are sometimes aggregated (the sources are commonly represented by the Census); this means that is not easy to infer the real impact of HSR from them. However, only direct surveys of tourists can actually be considered the right way to get information concerning tourists' chosen travel mode for reaching a given destination. In the following, some international experiences are reported.
Bazin et al. [21] assessed the effects of HSR on urban and business tourism, choosing French case studies. The main outcome was that HSR was chosen for urban tourism, especially for short-stay tourism. Moreover, HSR proved to be cheaper when travelling in groups and is more accessible with respect to other alternative transport modes if the rail station is located in the city centre.
Chen and Haynes [22] demonstrated, for the case study of China, that HSR services significantly affected the tourism market. Kuriharaa and Wu [23] studied the tourism variation in Japan. Specifically, tourism arrivals increased in cities served by the network.
Delpalace et al. [10] analysed the link between HSR and theme parks, i.e., Disneyland Paris and Futuroscope Parks, served by an HSR station. In the case study of Disneyland, HSR had an impact on tourists' behaviour. On the other hand, for the case study of Futuroscope, HSR was not important.
Giurao and Soler [24] and Campa et al. [25] highlighted a positive relationship in Spain between the increase in tourism outputs and HSR deployment.
Albalate and Fageda [26] and Albalate et al. [27], for the same case study, showed a negative impact of HSR on tourist arrivals and revenues. This behaviour might be attributed to a network design that does not correspond to the riders' needs.
The impact of HSR on the travel behaviour of tourists in Taiwan in relation to time, space and carbon emissions was investigated by Sun and Lin [28]. In general, HSR had a weak effect on travel distance and length of stay, but a 10% reduction in transport carbon emissions through intermodal substitution was registered. For the same case study of China, Wang et al. [29] showed that HSR increased tourism-based economic relationships between cities.
There are also some contributions in the current literature which focus on the probability of re-visiting a given destination by HSR. This aspect has been discussed by Seddighi and Theocharous [30], who studied the probability of revisiting Cyprus in terms of socio-demographic and destination characteristics, as well as by Barros and Assaf [31], who concluded that the probability of revisiting Lisbon "increases significantly with accommodation range, events, food quality, expected weather, beach, overall quality, nightlife, reputation, and safety". Delaplace et al. [32] studied the factors that have an impact on destination choice for tourism purposes and the role of HSR systems in affecting the choice to revisit Rome and Paris. Pagliara et al. [33,34] followed the same approach by comparing the factors influencing the choice to revisit Madrid and Paris.
In general, the gap that can be found in these contributions is that the general approach is an econometric one. On the other hand, the methodological approach proposed in this manuscript is based on the regression-tree approach, which has never been applied in this context.

The Methodology
In this paper, regression trees methods have been proposed with objective of analysing the impacts of HSR projects on tourism. Other case studies have proposed the use of these methods for making inference on tourism, mainly for analysing tourism-based regional economic development [35,36]. The dataset collected contains information both on tourism and transport for 99 Italian cities during the 2006-2013 period. These data are panel data, i.e., a combination of cross-sectional and time series data, where all information for each city is observed during the period under analysis. In the literature, there are several approaches dealing with these data, like the mixed effects or generalized estimating equation models [37]. The regression models have their basic assumptions and pre-defined underlying relationships between dependent and independent variables. If these assumptions are violated, the model could lead to erroneous estimation [38]. The use of tree-structured methods, like regression trees, can be seen as an application of a regression model in which some explanatory variables are used as covariates to predict the dependent variable values on the basis of some decision rules [39][40][41][42]. This approach establishes a casual effect between dependent and independent variables. These tree-structured methods do not require a priori probabilistic knowledge of the phenomena under study, and no assumption is required [43,44]. This can be considered as one of the main advantages of these methods with regards to standard econometric analysis [45]. A tree is a hierarchical and graphical representation of interactions between variables, formed by a finite number of nodes departing from the root node or father node (see Figure 2). The tree building approach is based on the implementation of the Classification And Regression Tree (CART) algorithm proposed by Breiman et al. [46]. The tree is a binary recursive splitting algorithm, where each parent node is linked to two children nodes: the left and the right nodes. The children nodes can be classified into internal and terminal nodes. An internal node is recursively treated as a parent node, and the whole process continues for all the subsequent nodes. The internal node is connected to its parent node at the top. At the bottom, a terminal node has no children nodes and represents the final result of a combination of decisions or events.
The splitting algorithm is based on increasing the internal homogeneity of the dependent variable Y. In order to perform recursive binary splitting, it is necessary to select the predictor X j ∈ X 1 , . . . , X p , and the cutpoint c, i.e., split the predictor space into the regions X X j < c and X X j ≥ c . This approach leads to the greatest reduction in impurity. Specifically, all predictors X 1 , . . . , X p , and all possible values of the cutpoint c for each of the predictors are considered, and then the predictor and cutpoint are chosen by considering the resulting tree with the lowest impurity [47].
In the regression tree algorithm, the impurity of a node is measured by the Least-Squared Deviation (LSD) R(t), which is the within variance of the dependent variable for the node t. It can be expressed as follows where N t is the number of observations in the node t, y i(t) are the individual values of the dependent variable at node t and y (t) is the mean of the dependent variable at node t. Given the impurity function, R(t), the split s determines the two subsequent nodes, i.e., the left t L (left) and t R (right) nodes. The goodness of a split is measured by a decrease in impurity ∆I(s, t) where R(t L ) is the sum of squares of the left child node and R(t R ) is the sum of squares of the right child node generated by the split s, while p L and p R are the portions of units assigned to the left and right children nodes [48]. A criterion has been selected to arrest the tree building, i.e., the minimum decrease in impurity equal to 0.01 [49].
However, in addition to the analysis of the link between the dependent and independent variables, it has been possible to compute a predictor ranking (also known as variable importance) based on the contribution of the predictors in building up the tree. The ranking considers that an important variable might not appear in any split in the final tree when the tree includes another surrogate variable. The latter is defined when two variables X ′ and X" are highly correlated and are strong competitors. In this case, if variable X ′ is first selected in the tree building process, this might prevent variable X" from being selected, masking the influence of the variable itself [50]. If the masking variable is removed, this variable could show up in a prominent split in a new tree that is almost as good as the original [51]. The importance of the score of a given variable X j is the sum of the improvement in impurity measures across all the nodes in the tree when it acts as a primary or surrogate splitter, as defined by Breiman et al. [40] M( The Variable Importance VI(X j ) is equal to where VI(X j ) are the importance values, given by M(X j ) divided by the largest importance values max M(X j ) and expressed as percentages. This allows the identification of the masking variable and non-linear correlation among attributes [52].

The Case Study
The dataset deals with information regarding the tourists' arrivals as well as transport modes for the 99 Italian provinces, observed during the 2006-2016 time period. Therefore, 1089 observations (99 provinces x 11 years) have been collected. The dependent variables considered are listed in the following: • LowCost: no. of operating bases of low-cost airlines.

Attractiveness variables
• GDP is the Gross Domestic Product of the province (Italian Census, ISTAT, www.istat.it); • Attraction: is the no. of activities in a given province (sum of museums, historical sites, etc., information collected through different websites); • Sea is a dummy variable assuming Value 1 if the province is close to the sea; 0 if otherwise; • POP is the number of inhabitants in a given province (hundreds of thousands-Census data); • Unemployment: percentage of unemployed in a given province (Census data).
The descriptive statistics of the variables are reported in Table 2, taking into account that the HSR, HUB2 and SEA are dummies, i.e., continuous variables. The choice of these independent variables is in line with the literature. Binary variables to describe the transport supply are, for example reported, in the papers of Albalate and Fageda [26]; Pagliara et al. [9,11] and Albalate et al. [27]. Moreover, in the paper by Albalate et al. [27]), just one variable, i.e., the hotel price index, was introduced by the authors to describe the attraction of a destination.
It is important to highlight that the unavailability of data for the whole period of analysis (i.e. ten years) represented a limitation of this manuscript, and therefore dummy variables have been reported to try to resolve the issue.

Results and Discussion
In this section, the regression trees representative of the collected data, relating to the analysis period corresponding to the years 2006-2016, will be presented.
In the regression trees, the following dependent variables have been assessed: (1) number of Italian Tourists (Italian Tourist); (2) number of Foreign Tourists (Foreign Tourist); (3) nights spent in tourist installations by Italian Tourists (Overnights_Italian); and (4) nights spent in tourist installations by Foreign Tourists (Overnights_Foreign).
The Italian Tourist variable, as shown in Figure 3, presents 26 nodes and the first split variable is the HSR one. If the provinces are served by the HSR, they attract a higher number of tourists, who are even more attracted if the number of LowCost companies is high, otherwise the presence of an HUB2 influences their choice. In the other branch of the tree, in the provinces not served by HSR, the tourist flow is influenced by the presence of a hub of a network carrier, and if there are a low number of low cost companies, their choice is influenced also by the unemployment rate and the number of inhabitants in the province. On the other hand, if there is not a second level hub, the variable that influences the tourist flow is the presence of attractions, followed by the presence of the sea.  The variable ForeignTourist, as shown in Figure 5, presents eight nodes and the first partition variable is represented by the HSR variable. As expected, cities with HSR attract more tourists and the tourist flow goes to the provinces with more attractions (such as the number of museums, archeological sites, etc.) present in a given province.
For provinces where HSR is not presented, tourists choose destinations served by second level hubs (HUB2) and without the presence of the sea.
Considering the importance variable values (see Figure 6), for foreign tourists, the variable Attraction is the most important. The LowCost variable is important as well. Among the transport variables, the most important one is the HSR, indicating that, in those provinces served by HSR, there is a greater flow. Less important, but with an impact on tourist destination choice, is the variable GDP, indicating the Gross Domestic Product of the province.
As shown in Figure 7, the tree representative of the independent variable Overnights_Italian is positively influenced by the transport variable HSR. If there is an HSR station, tourist flow is influenced by the attractions of the provinces and where there are less attractions for tourists, the latter are mainly influenced by the presence of a second level hub (HUB2). In the other branch of the tree, the presence of a HUB2 strongly influences tourists' choices, on the other hand the attractions of the provinces play a fundamental role.
By analysing the importance variables (see Figure 8), the most important variable is the one linked to the attractiveness of a given province, considering that an attractive province requires more nights to be spent at the destination. The GDP and LowCost variables are important as well. Figure 9 shows that the Overnights Foreign variable is influenced by the presence of HSR. If HSR is present, the most visited provinces are those with high attractions, while, in the other case, the presence of a HUB2 is very important. Foreign tourists choose provinces with a high GDP.
By analyzing the importance values (see Figure 10), the most important variable is the Attraction, followed by the GDP and the LowCost variables. The transport variable that involves a higher flow of foreign tourists is HSR, while the presence of HUB2 is not very significant.      The results obtained in this manuscript are consistent with other methodological approaches, proposed for the same case study of Italy, present in the literature. In the article of Pagliara et al. [9], the impact of HSR on the tourism market was studied with a database containing information both on tourism and transport for 77 Italian provinces, during the 2006-2013 period. Through the specification and estimation of a panel model, it was demonstrated that the effects of HSR on the number of visitors and the number of nights spent at destination are positive in all the provinces served by an HSR line. A different approach was proposed by Pagliara and Mauriello [11], where the impact of HSR on the tourism market was analysed through the specification of Geographically Weighted Regression techniques, embedded within a Poisson model. This approach can measure the relationship between independent and dependent variables with respect to space. The main outcome of the analysis, based on the same number of Italian provinces, i.e., 99, presented in this manuscript and the same the time period, i.e., the 2006-2016 one, was that HSR affects tourists' choices of a given destination.

Conclusions and Further Perspectives
HSR systems can have impacts on the tourist areas they serve, thanks to the increased accessibility they bring to the served areas. Indeed, this manuscript has found consistent evidence in favor of a positive relationship between HSR and tourist outcomes. Several approaches are present in the international literature to analyse whether this effect exists, both qualitative and quantitative, as described in Section 2.
In this manuscript, an analysis has been carried out with the aid of a dataset containing information on both tourism and transport for 99 Italian provinces during the 2006-2016 period.
The methodology proposed is the CART analysis, which, according to the authors' knowledge, has never been applied in this context before. CART provides both theoretical and applied advantages relative to the parametric models. Indeed, from the theoretical perspective, the advantage of the CART method is that it does not require the specification of the functional form of the model in advance or the assumption of an additive relationship between dependent and independent variables. Another advantage is that the CART analysis can effectively handle collinearity problems. When a serious correlation between independent variables exists, the variability of the estimated coefficients will be inflated. It follows that an interpretation of the relationship between an independent and dependent variable is difficult to define. On the other hand, regression tree methods are also not sensitive to outliers, since the splitting is based on the sample's proportion within the split ranges and not on the absolute values.
From the applied perspective, the regression tree methods are very intuitive and easy to explain. Moreover, they have the advantage of giving each variable the chance to appear in different contexts with different covariates, and thus better reflect its potential impact on the dependent variable. However, unlike a linear regression model, a variable in the CART algorithm can be considered highly important even if it never appears as a node splitter.
Further research is required on the use of HSR variables, which should describe the connectivity and territorial distribution of the HSR network, and the service conditions offered by the operating companies (e.g., fares, timetables, frequency). Specifically, considering that one of the limitations of the current literature is represented by aggregate data on tourism [53,54], authors suggest to employ ad hoc surveys in order to directly interview tourists and asking them their travel mode chosen to reach a given destination for their holidays. More disaggregate analyses should be represented in the international literature to fill this gap.
Moreover, the application of the same methodology to other case studies will be taken into account in order to make a comparison.
Author Contributions: F.P. has supervised, written and edited the whole paper. F.M. has collected the data set and developed the tree-methodology. L.R. has contributed to the development of Section 5. All authors have read and agreed to the published version of the manuscript.