Are Combined Tourism Forecasts Better at Minimizing Forecasting Errors?

This study, which was contracted by the European Commission and is geared towards easy replicability by practitioners, compares the accuracy of individual and combined approaches to forecasting tourism demand for the total European Union. The evaluation of the forecasting accuracies was performed recursively (i.e., based on expanding estimation windows) for eight quarterly periods spanning two years in order to check the stability of the outcomes during a changing macroeconomic environment. The study sample includes Eurostat data from January 2005 until August 2017, and out of sample forecasts were calculated for the last two years for three and six months ahead. The analysis of the out-of-sample forecasts for arrivals and overnights showed that forecast combinations taking the historical forecasting performance of individual approaches such as Autoregressive Integrated Moving Average (ARIMA) models, REGARIMA models with different trend variables, and Error Trend Seasonal (ETS) models into account deliver the best results.


Introduction
The perishable nature of tourism products and services such as hotel overnights, airplane seats, or restaurant tables makes forecasting an important prerequisite for setting efficient strategies to ensure business success. The special characteristics of tourism products and services such as perishability, intangibility, and consumption at the point of service delivery, external factors such as natural and man-made disasters, as well as unsteadiness of human nature make forecasting an important issue for international government bodies, national governments, academics, and practitioners alike.
In the past decades, many studies have dealt with the challenge of improving tourism demand forecasting accuracy, yet all these research efforts have led merely to the conclusion that no single forecasting method outperforms all others in all situations [1]. Furthermore, discussions of complexity have become increasingly relevant in the academic literature aiming to improve forecasting accuracies. Green and Armstrong [2] note that the trend to develop increasingly complex approaches has a long history, yet is at odds with scientific principles that advocate simplicity. An alternative way to use complex techniques to improve forecasting accuracy would be to combine the forecasts of individual forecasting models with the help of various combination techniques, as it has been shown that combined methods minimize the risk of extreme inaccuracy by "averaging out" the weaknesses of single models [3]. Forecast combinations are also capable of introducing adjustments and additional information balancing out measurement errors, which could negatively affect forecasting power [4].
Despite the many tourism studies carried out to date, the application of forecast combination techniques as a tool to create complexity remains rare. In contrast to other disciplines, research into combination methodologies for tourism demand forecasting has a short history, which started only in the early 1980s. Research into combining forecast methodologies was stimulated significantly earlier in various other economic and business fields by the seminal work of Bates and Granger [5]. They examined the performance of combining two sets of forecasts of airline passenger data, whereby the weights of the individual forecasts were calculated based on the historic predictive performance of each individual approach, and found that the combined forecasts showed lower errors than the individual forecasts. Only 20 years after the 1969 Bates and Granger paper, Clemen [6] summarized the intensive work that had been done around the topic of forecast combinations in the interim, and delivered an encompassing literature survey about these activities.
The manageable number of studies about forecast combinations in tourism have, in general, delivered the outcome that the combined forecast outperformed the forecasts generated by individual models [3,[7][8][9][10][11][12][13]. Our work compares forecasting accuracies for eight quarterly report periods spanning the period from March 2015 until August 2017 in order to check the stability of the results during a changing macroeconomic environment. In other words, the forecast evaluation exercise is not only carried out once for different forecasting horizons but eight times recursively (i.e., based on expanding estimation windows), which corresponds to a "natural" practitioner's situation.
To meet these different challenges, we developed forecasting models for arrivals and overnights and used Eurostat data according to our commissioned study for the European Union as a whole. These models were tested across eight different quarterly periods for their stability and accuracy. In addition, we calculated combined forecasts based on the forecasts produced by the various individual forecasting models, and assessed the accuracies of these combined forecasts. Therefore, the objective of this study was to analyze whether combined forecasts of selected models are able to outperform the forecasts generated by the individual models in terms of forecasting accuracy.
Forecasting for the European Union as a whole for the indicated period and predicting three and six months ahead, as well as using both single forecasting models and forecast combination techniques that are easily replicable by practitioners were requirements by the European Commission, which contracted the present study [14]. Tourism plays a major economic role in the European Union: 13.6 million people (9.5% of all employees in the non-financial sector) were employed by 2.4 million tourism businesses (around 10% of all non-financial businesses) in 2016 [15]. In the same year, the tourism industry accounted for 3.9% of turnover and 5.8% of value added in the non-financial sector.
The remainder of this paper is as follows: After the literature review (Section 2), we discuss the advantages of forecast combinations and the most-used combination methods (Section 3). In the methodological section (Section 4), we present the chosen modeling approaches, the accuracy measures and their application, as well as the data used. The subsequent section provides and discusses the forecasting performance of the different approaches (Section 5). Conclusions and recommendations are the final section of the paper (Section 6).

Related Literature
The prediction competition for time series forecasting already has a history of about 50 years. According to Hyndman [16], the earliest scientific study of time series forecasting accuracy-the Nottingham study-was done by David Reid [17]. Paul Newbold and Clive Granger took the next step by conducting a study of forecasting accuracy involving 106 time series [16,18]: one important result of their study was that forecast combination improves forecasting accuracy. Comparing forecasts became fashionable and, over the years, "forecasting competition" has become an important term in the forecasting literature.
The first big forecasting competition took place in 1982 and was organized by Spyros Makridakis and Michele Hibon. For this competition-known in the forecasting literature as the "M(akridakis) competition" -anyone could submit forecasts related to 1001 time series taken from demography, industry, and economics [16,19]. The following M-2 competition was organized in collaboration with four companies, this time using only 29 time series and with the main purpose of simulating real-world forecasting. A peculiarity of this M-2 competition was that forecasters were allowed to use personal judgements, to ask questions about the data, and to revise their previous forecasts for the next forecast [20]. The succeeding M-3 competition had the objectives of replicating and extending the features of the previous competitions, with more methods, more researchers, and more (i.e., 3003) time series [21]. The most recent competition that had already been completed while this study was written, M-4, ran from January to May 2018, used 100,000 time series, and considered all major forecasting methods, including those based on Artificial Intelligence, as well as traditional statistical ones [22]. Informing the major objective of our paper, a stable result across all the M competitions was that combined approaches, on the average, outperformed individual forecasts [17][18][19][20][21][22].
Other competitions have also been organized in parallel to the M competitions. Mathematicians and physicists interested in forecasting ran their own competition at the Santa Fe Institute, beginning in January 1992 [23]. Other examples for forecasting competitions are the application of neural networks [24] or the global energy forecasting competitions [25,26].
In tourism, research into combination approaches and their efficacy started significantly later than in other disciplines. One of the first studies about forecast combinations in tourism was that by Fritz et al. [10] about the combination of time series and econometric forecasts. Their paper presents parsimonious methods of improving forecasting accuracy by combining various forecasting techniques. The Box-Jenkins stochastic time-series method was combined with a traditional econometric technique to forecast airline visitors to the State of Florida. Some years later, Calantone et al. [8] confirmed the results of Fritz et al. [10] and showed, also for the State of Florida, that forecasts of tourist arrivals obtained by forecast combination were more accurate than forecasts based on individual approaches. Shen et al. [3] tested the accuracy of forecast combinations compared to the forecasting results of seven different techniques over different forecasting horizons and demonstrated that combinations were superior to the best of the individual forecasts. Song et al. [27] shed further light on these results by showing that combined forecasts may be more beneficial for long-term forecasting. Shen et al. [28] compared six different combination methods and found that those that consider the historical performance of individual forecasts perform better than simple uniform average methods. In contrast, Gasmi [11] demonstrated that from three combination techniques, the Granger-Ramanathan regression method [29] delivered superior results in comparison to the simple uniform average technique and the Bates-Granger variance-covariance technique [5].
Andrawis et al. [7] suggested combining forecasts based on diverse information using different time aggregations (e.g., monthly and annual data). In comparing several forecast combination techniques, they show that the approach using forecasts based on time series with diverse time aggregations outperformed the combined individual forecasts based on time series with the same time structure. For improving tourism forecasts, Cang [30] proposed a non-linear combination method using multilayer perception neural networks, which can map the non-linear relationship between inputs and outputs.
On the other hand, a minority of studies has shown that combined forecasts do not always outperform the best individual forecasts, but are almost certain to outperform the worst individual forecasts [27]. Furthermore, Song et al. [27] stated that combined methods outperform the best single forecast in fewer than 50% of cases on average. A few years later, Song et al. [31] similarly stated that according to their results, forecast combination only improved forecasting performance in the tourism context in just over 50% of all cases compared with the most accurate single prediction. In a similar vein, Gunter and Önder [12] found that combined forecasts based on Bates-Granger weights, on multiple forecast encompassing tests, as well as on a combination of the two approaches [32]. The authors applied the aforementioned forecast combination techniques to Google Analytics indicators used as leading indicators for forecasting tourist arrivals to Vienna. However, these quite complex techniques only performed well for longer forecasting horizons.
To the best of the authors' knowledge, only one true forecasting competition focused on tourism time series data has been held to date [33]. The data set included 366 monthly, 427 quarterly, and 518 annual time series, all supplied by either tourism bodies or academics who had used them in previous tourism forecasting studies. The forecasting methods implemented in the competition were univariate and multivariate time series approaches, and econometric models. Surprisingly, however, this competition did not evaluate the accuracies of combined forecasts compared to individual forecasts.

Advantages of Forecast Combination
Why do combined forecasts perform better than individual forecasts in many contexts? Bates and Granger [5] stated that the simple portfolio diversification argument justifies the idea of combining forecasts. Forecast combinations offer diversification gains that make it efficient to combine individual forecasts rather than taking forecasts from just one single model. The information set underlying the individual forecasts is often unobservable for the forecast user: potentially because it comprises private information. Differences between the subjective judgements of various forecasters could therefore reflect differences in their respective information sets. In this situation, it is not possible to pool the underlying information set and construct a superior model that captures each of the underlying forecasting models. On the other hand, the higher the degree of overlap in the information set used to produce the underlying forecasts, the less useful a combination of forecasts is likely to be [34]. Furthermore, when forecast users have access to the full information set used to construct the individual forecasts, combinations are sub-optimal and it might be better to recommend finding a superior single model [35,36].
A second reason for using forecast combinations is that individual forecasts are differently influenced by structural breaks caused, for example, by institutional change or technological developments. Some models may adapt fast and only be affected by the structural break for a short time, while other models have parameters with slower adaption speeds. Since it is typically difficult to detect structural breaks in real time, it is plausible that, on average-i.e., across periods with varying degrees of stability-combinations of forecasts from models with different degrees of adaptability will outperform forecasts from individual models [37]. Similarly, Stock and Watson [38] report that in cases of structural breaks, the performance of combined forecasts tends to be far more stable than that of individual forecasts.
Third, forecast combination could be viewed in the sense that additional forecasts act like intercept corrections (ICs) relative to a baseline forecast. ICs can improve forecasting not only if there are structural breaks, but also if there are deterministic misspecifications [39].
Fourth, pooling of forecasts can also be understood as a shrinkage estimation. According to this approach, the unknown future value is viewed as a "meta-parameter" of which all the individual forecasts are estimates [39]. In these cases, averaging may improve the estimates.
Often, we also measure the wrong things. Demand data are rarely, if ever, available: thus, instead of measuring demand, we measure supply data (e.g., in periods with over-utilization of production capacities). However, it is obvious that such proxies of apparent demand introduce systematic biases in measuring real demand and therefore increase forecasting errors [4]. Averaging forecasts, in turn, would balance out these potential errors. Similar conclusions can be drawn for measurement errors and unknown misspecifications.
Moreover, statistical models assume that patterns and relationships remain constant. This is not always given: especially in the real world, where events and actions or fashions bring systematic changes and therefore introduce non-random errors in forecasting. Combining forecasts would help to increase their accuracy. Combining is also expected to be useful when experts are uncertain of which method to choose. This may be because we encounter novel situations (e.g., Brexit, the COVID-19 pandemic, stock market crashes, etc.) or have to make forecasts for a longer time horizon.
A further argument for combining forecasts is that the underlying forecasts may be based on different loss functions [40]: let us assume there are two forecasters, both have the same information set for forecasting a specific variable; however, forecaster #1 dislikes large negative forecasting errors, while forecaster #2 dislikes large positive forecasting errors. As a consequence, forecaster #1 will under-predict, while forecaster #2 will over-predict. If the bias is not constant over time, a forecast user with a rather symmetric loss function would find a combination of these two forecasts better than following the individual ones.
While forecast combination has advantages, there are also several arguments against it. Following the literature, forecast combinations are at a disadvantage over a single forecasting models because they produce parameter estimation errors in cases where the weights to combine the different forecasts need to be estimated [40]. This important consideration of avoiding errors in estimating the weights for the forecast combination has led simple uniform weighing methods to dominate more complex combination methods in mainstream scientific practice. The important advantage is that the weights are known and therefore do not have to be estimated: this plays a role if there is little evidence on the performance of the individual forecasts or if the parameters of forecasting model are time-varying. Furthermore, in many situations, a simple uniform average of forecasts will result in a significant reduction in variance and bias through averaging out the individual biases [40,41]. In the literature, the most used simple combination approaches are the Simple Average Combination (SAC) method and the VAriance-COvariance (VACO) method [3,5].
The SAC method assigns equal weights to each of the individual forecasts instead of using any optimal weights to minimize the variance of the combined forecasts. Although forecast combinations with equal weights could be biased, they might contribute to the reduction of the forecasting error as these weights are not influenced by other errors accruing from the estimation of optimal weights [3,42]. According to Palm and Zellner [41], the SAC method has the following advantages. When there is little evidence on the performance of the individual forecasts, it is an important advantage that the weights are known and therefore, no estimation is necessary. Furthermore, in many situations, the application of the SAC method contributes to the reduction of the variance and the bias through averaging out individual biases. Another advantage of the SAC method is its avoidance of sampling errors and model uncertainty in estimating optimal weights.
According to the VACO method, past errors of each individual forecast are used to determine the weights in forming the combined forecasts [5,6]. Bates and Granger [5] suggest assigning higher weights to good forecasts (low errors) and lower weights to poor ones (high errors).

Forecasting Tourism Demand for the European Union
The objective of the commissioned study based on the research period 2005-2017 was to find the model best at forecasting arrivals and overnights for the total European Union in the short term. The competing models should have a low degree of complexity, as the "winning" model was to be applied by the European Commission and Eurostat to an actual database in order to look ahead into the near future and to mitigate the lack of tourism data reporting [14]. This study was designed for the European Commission in order to help them forecast tourism demand for the European Union as a whole. Thus, similar models were used to test the forecasting accuracy at each time period.
In doing so, we analyze if the combined forecasting approaches according to the SAC and VACO methods based on the outcome of Autoregressive Integrated Moving Average (ARIMA) models, REGARIMA models with different trend variables, and Error Trend Seasonal (ETS) models (a state-space framework comprising traditional exponential smoothing models) outperform the single models, in general and over all periods, or if specific individual models show superiority [4,[43][44][45]. Similar to the choice of the forecast combination techniques, the single forecasting models were also chosen based on the criterion of easy replicability by practitioners.
The database for the model estimations comprised the monthly arrivals and overnights statistics from Eurostat for the total of the EU-27 (unfortunately, no data were available for Ireland) from January 2005 until August 2017. In the estimations and forecasts, we separated between non-residents and residents (see also Figures 1 and 2). For calculating the forecasting accuracy, we performed out-of-sample forecasting for three and six months ahead. For the computations, we used EViews 9.5, and EViews 10 for the final quarterly report.

An Outline of the Models Used
For solving the forecasting problem, we used four different approaches: 1. Autoregressive Integrated Moving Average (ARIMA) models, 2. REGARIMA models with different trend variables, 3. Error Trend Seasonal (ETS) models, and 4. Combined forecasts (with Bates-Granger weights (VACO) and uniform weights (SAC)) based on the forecasts produced by the different single forecasting models mentioned above.

An Outline of the Models Used
For solving the forecasting problem, we used four different approaches: 1. Autoregressive Integrated Moving Average (ARIMA) models, 2. REGARIMA models with different trend variables, 3. Error Trend Seasonal (ETS) models, and 4. Combined forecasts (with Bates-Granger weights (VACO) and uniform weights (SAC)) based on the forecasts produced by the different single forecasting models mentioned above.

An Outline of the Models Used
For solving the forecasting problem, we used four different approaches: 1.
Combined forecasts (with Bates-Granger weights (VACO) and uniform weights (SAC)) based on the forecasts produced by the different single forecasting models mentioned above.

ARIMA and REGARIMA Models
A general ARIMA (p, d, q) model for a first and seasonally differenced forecast variable ∇∇ s y t (i.e., d = 1) in period t (t = 1, . . . , T) reads as follows: where ϕ(L) and ϑ(L) in Equation (1) denote lag polynomials of finite orders p and q, while a denotes the intercept and e t denotes the random error term. In this study, y t corresponds either to overnights or to arrivals for the total EU-27. A general REGARIMA (p, d, q) model for a first and seasonally differenced forecast variable ∇∇ s y t (i.e., d = 1) with a contemporaneous exogenous variable trend t , in turn, reads as follows: where the notation in Equation (2) corresponds to that of Equation (1). In this study, trend t corresponds either to a Hodrick-Prescott trend of the arrivals and the overnights or to various Google Trends indices.
It should be noted that the forecast variable is only first or second-differenced in the REGARIMA models with Google Trends indices, but not seasonally differenced.

Employed ARIMA Models
For all variables used in the ARIMA models, we applied seasonal differencing to remove any deterministic or stochastic seasonal patterns. Because of the existence of non-seasonal unit roots, we additionally took first differences to achieve difference-stationary processes.
For the three and six month forecasting periods and to achieve the best model fit, we employed an ARIMA (2, 1, 0) approach to model the overnights of residents and non-residents. To model the arrivals of non-residents, we applied the ARIMA (2, 1, 1) model, while we chose the ARIMA (2, 1, 0) model for the residents. All the estimated coefficients were statistically significant. The estimated equations also had excellent results in terms of out-of-sample forecasting accuracy (see as an example the forecasting accuracies for arrivals and overnights for the study period December 2017 in Tables A1-A8 in Appendix A).

Employed REGARIMA Models with Hodrick-Prescott Trends
In order to manage the short-term forecast problem, a quasi-causal model was constructed to explain arrivals and overnights. This model is based on a REGARIMA approach, which uses the flexible trend of the overnights/arrivals being explained through the model as its contemporaneous exogenous variable. The flexible trend was identified by the Hodrick-Prescott (HP) filter method and indicates important exogenous aggregated information in the model [43,46]. Correcting the forecast based on the flexible trend with AR and MA errors optimized the results.
In doing so, the overnights/arrivals values were transformed into absolute previous year differences of the moving 12-month averages of the log-transformed original values. 12-month averages were used to adjust for seasonal fluctuations, calendar effects, and special events. The explanatory variables are absolute previous year differences of the flexible overnights/arrivals trends identified by the HP filter method [43,46]. As the HP filter is based on the moving 12-month average of the log-transformed original values, these variables are easy to extrapolate by using exponential smoothing methods for forecasting purposes [4].
For the three and six month forecasting periods, we employed REGARIMA approaches and corrected the processes with AR (1) errors for modeling non-residents' and residents' overnights. To model the arrivals of non-residents, we applied approaches corrected with MA (1) errors, while for the residents we corrected with AR (1) errors. All the estimated coefficients were statistically significant. The estimated equations also had excellent results in terms of out-of-sample forecasting accuracy.

Employed REGARIMA Models with GoogleTrends
Based on the REGARIMA models outlined in the previous section, we also used model variants with Google Trends indices instead of HP trend variables. In order to achieve stationarity, we took the first or second difference of the variables (no seasonal differencing was performed). First differences were taken of arrivals, while the Google Trends indices were used as they came. Second differences were taken of overnights, while the Google Trends indices were employed in first differences.
For the three and six month forecasting periods, we employed the aforementioned REGARIMA approaches. Various correction processes were necessary for both the arrivals and overnights data. To model the arrivals of the non-residents, we applied approaches corrected with AR (2) and MA (1) errors, while for the residents, we corrected with AR (2) errors. For modeling overnights, we applied AR (2) errors throughout.
Next, we explain what Google Trends are, where we can find them, and how we can use them. Google provides search data at an aggregated level (as an index) on its Google Trends page (http://trends.google.com/trends/), where users can identify the topics trending in search results or investigate a specific search term to learn about its popularity in different parts of the world. These data are open and free of charge to Google account holders, and can be downloaded in common spreadsheet formats to be used for analytical purposes, including forecasting.
In order to determine which web search term was more useful, we collected and developed four types of Google Trends variables: (i) a Google Trends index with country name (EU_trends), (ii), a Google Trends index with country names and flights (EU_flights), (iii) a Google Trends index with country names and hotels (EU_hotels), and iv) a Google Trends index with country names, flights, and hotels (EU_travel). There is not a consensus among researchers about how to choose the keywords for analysis. One method was directly choosing the keywords by subjective assessment of a set of text or data [47] and in this study we used the keywords that are related to travel planning (i.e., flights, hotels) under the travel category of Google Trends.
To calculate the EU_trends variable, the name of each of the 27 European Union countries was used as the search term to retrieve the respective monthly search indices from the Google Trends website. The data were retrieved in monthly intervals between January 2016 and December 2017 and compiled in a single Excel file. Then, to generate a regional EU_trends variable, the average of the index values across the 27 countries was calculated for each month separately. The EU_flights and EU_hotels variables were also calculated in a similar way: the search terms being each country's name followed by "flights" or "hotels", respectively. As a last variable using Google Trends indices, we generated an EU_travel variable by summing all the search indices used for the previously calculated monthly "EU" indices, such that EU_trends + EU_hotels + EU_flights = EU_travel.
In most cases, the estimated models had statistically significant parameters and satisfactory forecasting accuracy results (for an example see Tables A1-A8 in Appendix A). The reason why insignificant variables were retained was so the models could be tested using the same variables for the whole duration of the study period. A second reason was that, although some variables were indeed statistically insignificant in specific periods but not over the whole duration time of the study, this does not imply that these variables are unimportant for forecasting. In doing so, we follow a recent statement of the American Statistical Association (ASA) pointing out very clearly that the widespread use of statistical significance-to be understood as a 5% p-value threshold-as a justification for scientific findings leads to a biased perception of the scientific process [48]. The ASA statement has been an impulse to the scientific community to move further toward a world beyond p < 0.05 and a signal to recognize that statistical inference is not equivalent to scientific inference [49,50].

Error Trend Seasonal (ETS) Models
The Error Trend Seasonal (ETS) model class was developed by Hyndman et al. [45,51] and encompasses various well-known exponential smoothing methods (e.g., single exponential smoothing, double exponential smoothing, additive and multiplicative seasonal Holt-Winters) within a theoretically founded state-space framework which is estimated recursively by employing maximum-likelihood methods. The ETS framework consists of a signal equation for the forecast variable and a number of state equations for the three components that cannot be directly observed: level, trend, and seasonal.
Since the ETS framework can automatically detect both trend and seasonal patterns and apply the most suitable model, further data transformation was not necessary.
Generally speaking, an ETS(·, ·, ·) model is represented by one of the following configurations of the error, trend, and seasonal components of the forecast variable, i.e., total EU-27 overnights or arrivals in this study [52]: where A in Equation (3) corresponds to additive, A d to additive damped, M to multiplicative, M d to multiplicative damped, and N to none. This makes a total of 30 possible ETS specifications. From these, the most suitable specifications were automatically selected by employing information criteria such as the Akaike information criterion (AIC) and the Schwarz or Bayesian information criterion (BIC). Information criteria such as AIC and BIC are means for model selection and offer a relative estimate of the information lost in terms of the log likelihood function when a given model, e.g., a particular ETS specification relative to all other ETS specifications, is used to represent the process that generates the data [44]. Here, AIC and BIC have been calculated for all 30 possible ETS specifications, and the specifications characterized by the minimum AIC value and BIC values were then used for estimation and forecasting.
As an example of an ETS specification for three months ahead forecasting of overnights of non-residents in the EU-27 as selected by BIC, please see Table A5 in Appendix A. The one signal and two state equations of the selected ETS (M, N, A) specification read as follows [52]: where Equation (4) corresponds to the signal equation for the forecast variable y t , Equation (5) to the state equation for the unobservable level component l t , and Equation (6) to the state equation for the unobservable seasonal component s t . α and γ denote the two smoothing constants, while the remaining notation in Equations (4) to (6) corresponds to that of Equations (1) and (2). It should be noted that the ETS(M, N, A) specification does not contain a state equation for the unobservable trend component since it is not present.

Forecast Combinations
Apart from the individual forecasting models, also the merits of two forecast combination techniques are evaluated. Bates and Granger [5] indicate that combination forecasts can yield lower forecasting errors, a finding which was later confirmed by Clemen [6]. To this aim, the forecasts produced by the 64 models, separately for forecast horizons three months and six months ahead, were all combined (or averaged) based on two methods that are common in the literature: simple uniform weights (unweighted average, or SAC) and so-called Bates-Granger weights (weighted average, or VACO). On the other hand, there are more complex combination approaches available, with the given disadvantage that additional errors through parameter estimations might flaw the results. To avoid these additional errors sources, we used only combination methods based on calculated weights in this study.
More formally, a simple uniformly combined forecast F U h of overnights/arrivals for non-residents/ residents for a forecast horizon h (h = 3, 6) is calculated as follows: where F m h in Equation (7) is a forecast value produced by one of the single m (m = 1, . . . , M) competing single forecasting models.
The Bates-Granger weight of an individual forecasting model is calculated as the inverse of the mean square error of that forecasting model relative to the sum of the inverses of the mean square errors of all forecasting models. Hence, a better individual forecasting model receives a relatively higher weight when calculating the average forecast.
More formally, a Bates-Granger [5] combined forecast F BG h of overnights/arrivals for non-residents/ residents for a forecast horizon h (h = 3, 6) is calculated as follows: where F m h in Equation (8) is a forecast value produced by one of the single m (m = 1, . . . , M) competing single forecasting models, and MSE m h denotes the corresponding mean square error.

Results
The forecasting models were assessed based on the comparison of their ex-post out-of-sample forecasting accuracy in terms of the root mean square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE). The averaged (or combined) forecasts were then treated the same way as the forecast values produced by the individual forecasting models, which means that the same error measures are also calculated for them.
To evaluate the forecasting accuracy of the different models, we ranked the values yielded by the various forecasting accuracy measures employed and summated the scores: this procedure allows for the interpretation that the forecasting model with the lowest total score delivers the best forecasting accuracy (see as an example the results for the December 2017 report in Tables A1-A8 in Appendix A). In the next step, we compared the total scores added over eight report periods ( Tables 1 and 2). These tables present an evaluation by forecast horizon separated between non-residents and residents, as well as an overall rank. In addition, one goal of the commissioned study was to give an overall recommendation across forecasting accuracy measures, report periods, and forecasting horizons. Therefore, presenting an overall rank became necessary as well.
Comparing the forecasting performance between March 2015 and August 2017, incorporating eight different forecasting situations as well as different macroeconomic environments, the combined forecasts with Bates-Granger weights continuously deliver the most accurate forecasts (see Tables 1 and 2). Generally speaking, this is the case for arrivals, overnights, all forecast horizons, and tourist types. However, in one case, the six months ahead forecasts of overnights by non-residents, the combined forecast approach based on Bates-Granger weights ranked second behind the ARIMA (2,1,0) model, thus failing to perform best as it had in all other cases, although the differences in the rank totals were so small that both methods could be considered as sharing the first rank. In the competition for the top three ranks, the combined forecasts with Bates-Granger weights ranked first seven times and came second once. The ARIMA models ranked first once, tightly beating the weighted combined forecasts, ranked second once, and ranked third twice. The combined forecasts with uniform weights achieved the second rank five times and the third rank once. The REGARIMA models with Google Trends indices for the European Union scored one second rank and two thirds. The ETS (AIC) models and REGARIMA models with Google Trends indices for hotels and travel each achieved the third rank a single time.
When analyzing the overall ranks of the two employed forecasting combination techniques separately for the eight report periods, combined forecasts based on Bates-Granger weights achieved the lowest sum of ranks in 22 out of 64 possible cases, whereas combined forecasts based on uniform weights achieved the lowest sum of ranks in 16 cases (detailed results are available from the authors on request). Thus, for approximately 60% of all cases and when only considering best ranks, averaging over the results of the individual forecasting models was definitely worthwhile. It should also be noted that none of the individual forecasting models performed extremely poorly and that the results were quite similar across report periods. For the given sample, minimizing the impact of individual forecasting models performing slightly poorly in terms of assigning a lower weight when calculating the Bates-Granger combined forecast proved to be sufficient. Consequently, more complex approaches (e.g., using a formal screening procedure based on statistical criteria such as a forecast encompassing test) was not necessary (and would also have been at odds with the simplicity requirement for the methodology).

Conclusions
This study compared the accuracy of individual and combined approaches to forecasting tourism demand for the total European Union. The evaluation of the forecasting accuracies was performed recursively for eight periods spanning two years in order to check the stability of the outcomes during a changing macroeconomic environment. The analysis of the out-of-sample forecasts for arrivals and overnights showed that forecast combinations taking the historical forecasting performance of individual approaches such as Autoregressive Integrated Moving Average (ARIMA) models, REGARIMA models with different trend variables, and Error Trend Seasonal (ETS) models into account deliver the best results.
The analysis of the three months ahead and six months ahead out-of-sample forecasts for arrivals and overnights (non-residents, residents) showed that the VACO method was clearly more accurate, followed by the ARIMA approach, and the uniformly weighted combined forecasting method (SAC). The results and their stability over a two-year observation period demonstrate that taking the historical forecasting performance into account contributes to a significant improvement of forecasting accuracy and is recommended for practical application.
One particular advantage of the SAC and VACO methods in addition to their excellent performance is that they can be easily implemented by practitioners, which typically is not the case for more complex forecast combination techniques such as multiple encompassing tests. This simplicity also holds for the employed single forecasting models, which are typically part of any modern statistics and econometrics software. One further advantage is the ready availability of the different trend variables employed as exogenous variables in the REGARIMA models, such as HP trends and the various Google Trends indices.
Limitations of our study are that we used very simple models for practical reasons and did not test whether the introduction of complex models in the forecasting competition would have changed the rank orders in terms of the forecasting accuracies. In line with the general need to improve forecasting accuracy, discussions of complexity have also become increasingly relevant in the academic literature, although building complex approaches is at odds with scientific principles that advocate simplicity. On the other hand, a way to use complex techniques in order to improve forecasting accuracy is also the combination of the forecasts of individual forecasting models, as employed in this study, since combined methods minimize the risk of high inaccuracy by "averaging out" the weaknesses of the single forecasting models. They are also capable of introducing adjustments and additional information, which can balance out measurement errors and thereby affect forecasting power.
Future research efforts could concentrate on developing forecasting models for the single European Union countries and testing if the forecasting accuracy performances of the different methods stay stable across countries or if significant differences can be detected. For practical reasons, such country-level approaches could even be more useful for governments and national tourism boards. Another research endeavor could investigate the sensitivity of the forecasting performances of the different methods in terms of different data frequencies (e.g., quarters instead of months).