Forecasting Regional Tourism Demand in Morocco from Traditional and AI-Based Methods to Ensemble Modeling

: Tourism is one of the main sources of wealth for the Moroccan regions, since, in 2019, it contributed 7.1% to the total GDP. However, it is considered to be one of the sectors most vulnerable to exogenous shocks (political and social stability, currency change, natural disasters, pandemics, etc.). To control this, policymakers tend to use various techniques to forecast tourism demand for making crucial decisions. In this study, we aimed to forecast the number of tourist arrivals to the Marrakech-Saﬁ region using annual data for the period from 1999 to 2018 by using three conventional approaches (ARIMA, AR, and linear regression), and then we compared the results with three artiﬁcial intelligence-based techniques (SVR, XGBoost, and LSTM). Then, we developed hybrid models by combining both the conventional and AI-based models, using the technique of ensemble learning. The ﬁndings indicated that the hybrid models outperformed both conventional and AI-based techniques. It is clear from the results that using hybrid models can overcome the limitations of each method individually.


Introduction
Tourism is considered to be one of the main sources of wealth for the Moroccan economy, since, in 2019, it contributed 7.1% to the total GDP. From economic and social perspectives, the tourism sector enhances the development of local enterprises by creating consequential jobs in local areas, stimulating local demand, as well as being an important source of foreign currency inflow. The development of tourism also tends to have positive benefits which are highlighted through its contributions to the implementation and upgrading of the country's infrastructures (roads, tourist zones, hotels, hostel, etc.). Despite these advantages, tourism is one of the sectors most sensitive to exogenous shocks (political and social stability, currency change, natural disasters, pandemics, etc.). Various tools have been used by policymakers to sustain the growth of the tourism sector such as marketing, advertising, tourism events, etc. However, scholars tend to use more rigorous methods and multiple key indicators to capture tourism demand, such as tourism demand forecasting, which has become a key instrument for policymakers. Over the past decade, researchers have focused on using traditional techniques such as time series methods, econometric models, qualitative techniques, etc. Recently, researchers have used more rigorous methods such as artificial intelligence-based modeling. To date, only a limited number of researchers have tried to compare the performance of both conventional and AI-based techniques in terms of forecasting accuracy; moreover, none of them have tried to combine them into hybrid models and to assess the results.
The existing literature review reveals that most researchers have adopted the number of tourist arrivals as a proxy for tourism demand [1], for example, a study by [2] used the number of tourist arrivals in Beijing as compared with an internet search index to verify the unit, support vector regression, and artificial neural network) techniques to forecast tourist arrivals from 2010 to 2019 using monthly tourist arrival data; the findings suggested that the LSTM and GRU frameworks performed better than the others. [26] used an artificial neural network by examining its capability in the context of COVID- 19.
The summary of the literature relating to the proxy adopted to capture tourism demand has shown that the number of tourist arrivals was the most used variable. The review also discussed three main methods, i.e., econometrics, time series, and AI-based techniques used by researchers in the field of tourism demand forecasting.
The main purpose of this study is to forecast tourist arrivals using three different approaches. We start with first-level modeling which is the conventional method based on time series and econometric techniques, and the unconventional method which is drawn from the AI-based techniques capable of dealing with nonlinear behaviors. Finally, in second-level modeling, we combine both conventional and AI-based methods into hybrid models to overcome the limitations of the individual approaches.
The motivation behind this work comes from three perspectives. First, it is related to the context of the study area, where the tourism sector is considered to be one of the main sources of wealth for the Moroccan regions since it contributes to the total GDP, and therefore, forecasting is a crucial technique for policymakers. Second, this is the first research to shed new light on the use of hybrid models for tourism demand forecasting in Morocco at a regional level. Third, the present study fills a gap in the literature related to the debate between conventional methods such as econometric and time series methods and new type of models derived from artificial intelligence techniques with the possibility of hybrid models.
The remainder of this paper is organized as follows: In Section 2, we describe the source for the data and the preprocessing procedures, and then present the metrics used for measuring forecasting accuracy, as well as the theoretical background of all adopted approaches; in Section 3, we analyze the results and findings of the two approaches, and then the hybrid models; in Section 4, we provide a discussion; and finally, in Section 5, we summarize our conclusions and remarks.

Data and Preprocessing
To forecast tourism demand at a regional level, we used the number of tourist arrivals to the Moroccan regions from 1999 to 2018; the data were obtained from the Department of Economic Studies and Financial Forecast in The Minister of Economy and Finance. To validate the efficiency and the performance of the forecasting models, the data were split into two datasets, i.e., the estimating sample (training dataset) and the validation sample (testing dataset).
For more accurate results, the data were divided equally between training and testing (50% as testing data and the rest as training data), that is, data from 1999 to 2008 as training data and data from 2009 to 2018 as testing data. Sometimes data can raise serious issues related to an appropriate splitting scale. Since we had only 19 observations in total, we required the optimal choice for splitting the data. Choosing a small dataset for training could lead to overfitting the model, because the model could adapt excessively by learning all the possible hidden patterns in the data, and therefore, perform poorly in approximation. Conversely, a small testing dataset would likely generate roughly optimistic results.
The conventional and AI-based model approaches both have two different mechanisms of forecasting. It is well known that traditional techniques are limited to capturing only increasing/decreasing linear patterns of data. By using the period 1999-2008 as a training dataset, conventional techniques would succeed in learning the increasing trend (linear pattern) of data. However, using, for example, a random split or different split configuration, would leverage the AI-based model, which is built to handle complex patterns and penalize the traditional methods. In general, we tried to find a middle ground to keep the necessary statistical characteristics of each approach, especially, for the traditional approaches. Technically, we tried different split scenarios; the split selected was used for comparisons.
The data needed min-max scaling between a range of 0 to 1 using Equation (1); some machine learning and deep learning algorithms such as SVR and LSTM rely crucially on feature scaling and can be beneficial in improving forecasting accuracy. Feature scaling was done by using the following formula: where x min and x max are, respectively, the minimum and maximum values of the tourist arrivals to each region and x t is the number of tourists arrivals during year t. If x t is at the minimum value, the numerator will be 0, then, z t equals 0. Conversely, if x t is at the maximum value, the numerator is equal to the denominator, then, z t equals 1. Alternatively, if x t is between x min and x max , then, z t is in the range of 0-1.
Regarding the appropriate scaling method, the use of normalization versus standardization has been debated. Using gradient descent as an optimization framework (used by the SVR and LSTM) requires data scaling first. However, standardization scaling is used when dealing with data that contain unwanted extreme outliers; since standardization does not have a bounding interval, it smooths the data. In addition, it is necessary when dealing with features that converge to a normal or Gaussian distribution. Unlike neural network models, SVM, etc., standardization is a mandatory assumption for logistic regression, linear regression algorithms, etc.
Alternatively, normalization is used when the variable/process is behaving in a non-Gaussian distribution/or unknown distribution, which is an assumption that most algorithms do not require (e.g., LSTM, KNN, and SVR). Sometimes data capture shocks, which is a significant factor when the objective of a study is to analyze the impact of these outliers/shocks on the data. In addition, normalization can be used when we have multiple features with a different scale (multivariate analysis). The choice between normalization and standardization is based on each data, study, objective, type of analysis (univariate or multivariate), and type of variable (quantitative, binary, nominal, and categorical), which leads to different results. Taken together, the normalization scaling method tends to suit our case.

Metrics for Accuracy Measures
Traditionally, conventional models use statistical significance known as the p-value (mostly p < 0.05), but the pretesting process changes the distribution of the estimator parameters. To overcome this issue, researchers use predictive ability with multi-criteria measures, such as the root mean square error (RMSE). The RMSE evaluates the quality of the forecasting by showing the level of deviation between predictions and the actual values using Euclidean distance. The lowest RMSE means that more forecasted data are close to the real data. The RMSE is not robust for scale-invariance, meaning that RMSE is affected by min-max scaling; as a result, this measure is used over scaled datasets. The square is used to sum the percentage errors without attention to sign to compute the RMSE. For the mean absolute percentage error (MAPE), the error is determined as the actual data minus the forecasted data. Because this measure is a percentage, it is easy to understand, i.e., the lower the MAPE value is, the higher the accuracy of the forecast. The mean absolute error (MAE) examines the average of absolute error values of the prediction on all instances of the testing data. Table 1 shows the mathematical formula for each metric.
Scaling features should only be applied for traditional techniques. Since we are dealing with a univariate series analysis, scaling is not a necessity when applying traditional techniques. For example, AR or ARIMA methods, indeed, in some cases, can be crucial when the context is a multivariate time series analysis (ARIMAX and VAR). For example, the scale of the predictors has a significant magnitude than the dependent variable, in this case, non-scaling generates estimated coefficients that are large, which amplify the effect of predictions on the dependent variable when the model moves to the forecasting step. While scaling, in the case of small magnitude, will not harm the model.

Accuracy Metrics Formula
Where N is the number of data points (time span), y t is the actual values, andŷ t is the forecasted values.

Methodology
This study's primary contribution is to evaluate the performance of various machine learning approaches and compare them to conventional approaches in terms of forecasting accuracy, and then combine the two approaches into hybrid models to overcome the limitations of each approach. On the one hand, for example, ARIMA fails to predict complex patterns [21]; on the other hand, LSTM is a powerful neural network capable of mapping complex patterns. The idea behind a hybrid model is to capture the unique features of a data pattern by each approach, and then combine them into one blended model. For example, if a time series variable has two features, i.e., a linear and nonlinear behavior, using only LSTM captures only nonlinear behavior and the linear behavior is set as an error. Inversely, if we use only the ARIMA model, it captures only the linear part, setting the rest as an error, and if we combine the LSTM and ARIMA the two components linear and nonlinear (nonlinear here represents a model that uses nonlinear parameters) are captured. Researchers have found that hybrid methods that combine nonlinear algorithms and a linear process in time series variables have considerably more significant results [27]; these combined models can be used at the same time to capture both behaviors. The three artificial intelligence-based models used and the three conventional models are summarized below.

Support Vector Regression (SVR)
The SVR attempts to reduce error by finding the hyperplane and minimizing the difference between the forecasted and the observed data. The SVR was found to be more performant in prediction as compared with other approaches such as KNN and elastic net, due to enhanced optimization strategies for a wide range of factors.
The statistical theory behind this method was developed by [28]. Considering the training data as follows: where x t is the input vector at time t and y t is the related tourism demand for every x t . The prediction model is given by the following regression function f (x t ): where β is the bias and w is the weight vector. The goal is to obtain an f (x t ) with highly generalized behavior. For this purpose, the user needs to optimize the model complexity and the training error tolerance. The complexity of the model can be demonstrated by f (x t ) flatness, meaning having a small weight vector (w). This can be achieved by minimizing the Euclidean norm ||w||. To control training error tolerance, we can use the ε − insensitive loss function proposed by Vapnik in 1997. The ε − insensitive loss function ignores errors that are within ε range of the actual value by treating them as equal to zero. The measured distance between the actual value and the ε range can be described using the following expression: The loss function penalizes the model complexity by penalizing deviations greater than one, which means all training data larger than the ε − insensitive band. As seen graphically in Figure 1, when dealing with a one-dimensional linear regression function using the ε band, all values inside the ε band are equal to zero.
where is the bias and is the weight vector. The goal is to obtain an ( ) with highly generalized behavior. For this purpose, the user needs to optimize the model complexity and the training error tolerance. The complexity of the model can be demonstrated by ( ) flatness, meaning having a small weight vector (w). This can be achieved by minimizing the Euclidean norm || ||. To control training error tolerance, we can use the − loss function proposed by Vapnik in 1997. The − loss function ignores errors that are within range of the actual value by treating them as equal to zero. The measured distance between the actual value and the range can be described using the following expression: The loss function penalizes the model complexity by penalizing deviations greater than one, which means all training data larger than the − band. As seen graphically in Figure 1, when dealing with a one-dimensional linear regression function using the band, all values inside the band are equal to zero. The program that minimizes the error function is as follow: where * and are defined as Lagrangian multipliers and the training values positioned outside the − tube are referred to as the support vectors. This function can be solved using linear regression in feature space by introducing a kernel function k: Solving ( ) relies on using different types of kernels (such as Gaussian kernel, polynomial kernel, and linear kernel), see [28,29] for a full demonstration of the SVR. A The program that minimizes the error function is as follow: where α * m and α m are defined as Lagrangian multipliers and the training values positioned outside the ε − insensitive tube are referred to as the support vectors. This function can be solved using linear regression in feature space R z by introducing a kernel function k: Solving k x T m x relies on using different types of kernels (such as Gaussian kernel, polynomial kernel, and linear kernel), see [28,29] for a full demonstration of the SVR. A full description of the SVR method and its parameters related to tourism forecasting can be seen in [20,[30][31][32]].

Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is the second approach, initially proposed by [33], and is considered to be an advanced recurrent neural network (RNN); LSTM is capable of solving the vanishing/exploding gradient issue that RNN algorithms face. The RNN is un-able to recall or remember the long-term independence as a result of a vanishing/exploding gradient (similar to reading a book and not remembering the previous chapters). LSTM models are designed to avoid long-term dependency issues. They can be utilized for classifying and/or predicting using time series data. A typical LSTM architecture contains a memory cell (or the cell state), input gate, forget gate, and an output gate; Figure 2 illustrates the architecture of the LSTM model at a time t as in [25].
full description of the SVR method and its parameters related to tourism forecasting can be seen in [20,[30][31][32]].

Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is the second approach, initially proposed by [33], and is considered to be an advanced recurrent neural network (RNN); LSTM is capable of solving the vanishing/exploding gradient issue that RNN algorithms face. The RNN is unable to recall or remember the long-term independence as a result of a vanishing/exploding gradient (similar to reading a book and not remembering the previous chapters). LSTM models are designed to avoid long-term dependency issues. They can be utilized for classifying and/or predicting using time series data. A typical LSTM architecture contains a memory cell (or the cell state), input gate, forget gate, and an output gate; Figure 2 illustrates the architecture of the LSTM model at a time t as in [25]. The following formulas structure the LSTM model based on two equations: The equations of the gates: where is the input gate; is the forget gate; is the output gate; is the activation function (logistic sigmoid, tanh, or ReLu); , , and are the weight of gates; ℎ is the output of the previous gate at t − 1; is the input at time t; , , and are the biases of each gate.
The equations of the cell state and the output gates are: where is the memory (or the cell state) at time t and ̃ is a candidate for cell state at time t. The input gate ( ) determines which data should be added to the cell state. The forget gate ( ) controls whether data from the prior memory should be discarded or preserved. In the end, the output gate ( ) outputs the value. During the processing, it sends the previous hidden state to the next step of the sequence. The neural network's memory is stored in the hidden state. It stores information about prior data that the network has seen. The first step combines the input and the hidden state forming a vector . This has information about the current input and the previous inputs. The passes through the tanh activation function (tanh activation function squishes the data values so that they are always between −1 and 1, to assist in controlling the values that flow through the network); The following formulas structure the LSTM model based on two equations: The equations of the gates: where i t is the input gate; f t is the forget gate; o t is the output gate; σ is the activation function (logistic sigmoid, tanh, or ReLu); w i , w f , and w o are the weight of gates; h t−1 is the output of the previous gate at t − 1; x t is the input at time t; b i , b f , and b o are the biases of each gate.
The equations of the cell state and the output gates are: where c t is the memory (or the cell state) at time t and c t is a candidate for cell state at time t. The input gate (i t ) determines which data should be added to the cell state. The forget gate ( f t ) controls whether data from the prior memory should be discarded or preserved. In the end, the output gate (o t ) outputs the value. During the processing, it sends the previous hidden state to the next step of the sequence. The neural network's memory is stored in the hidden state. It stores information about prior data that the network has seen. The first step combines the input and the hidden state forming a vector x t . This x t has information about the current input and the previous inputs. The x t passes through the tanh activation function (tanh activation function squishes the data values so that they are always between −1 and 1, to assist in controlling the values that flow through the network); the result or the output is the memory of the network (or the new hidden state). In addition, the logistic sigmoid activation function can be used for squishing data between 0 and 1, and choosing between tanh or sigmoid is based on the data characteristics. According to researchers, tanh and sigmoid functions have been found to have some issues regarding vanishing/exploding gradient. Nowadays, rectified linear activation function (ReLu) activation is widely used and outperforms tanh and sigmoid functions. LSTM is described in detail by [33].

eXtreme Gradient Boosting (XGBoost)
The eXtreme Gradient Boosting (XGBoost) is a supervised learning algorithm proposed by [34]; it is based on the gradient boosting framework's concepts. It improves the forecasting performance by introducing more regularized model formalization to control overfitting problems using the classification and regression tree (CART).
The XGBoost process can be formalized as follows: whereŷ ı is the predicted output, m is the number of CARTs used to illustrate the model, f m (x i ) is the predicted output in the m-th tree, and F is the space of regression trees. By minimizing the following regularized objective function, it includes the loss function and term of regularization: where n is the number of observations, l is the second-order derivative loss function that measures the difference between the real data y i and the predictedŷ ı , Ω( f m) is the term of regularization, T is the number of leaves in the tree, w 2 j is the weight of the leaves, and complexity parameter γ, and β to control the tree. The goal of this optimization program is to define the structure of the CART and the weight of each tree, also we cannot optimize in the traditional Euclidean space. However, the program can be solved using an additive manner described in [34].
We choose three conventional techniques, namely ARIMA as the most used time series (TS) model in tourism demand forecasting literature, autoregressive (AR), and univariate linear regression.

Univariate Linear Regression
The linear regression mostly applied to analyze the relationship between a dependent variable Y t and a dependent variable x t known as simple linear regression or a set of dependent variables x t (i = 1, . . . , n is the number of independent variables) which is called multiple linear regression, can be constructed as follows: where β 0 is the constant, β i are regression coefficients to estimate, and ε t is the error term in time t. In the case of a simple linear regression, the above equation becomes: where Y t is the target or the output value (the dependent variable) and x t is the input value.
Since we want to use only one variable in the linear regression, Equation (12) becomes: In Equation (13), we have only one input feature (tourist arrivals), which is x t ; θ 0 and θ 1 are the regression coefficients; and the univariate linear regression is implemented in scikit-learn library in Python [35]. The purpose of choosing this technique is that it can only capture the linear pattern of the data and the simplicity of its statistical learning makes it much like conventional techniques.

Autoregressive (AR) Model
The autoregressive model is a time series model used to forecast a variable based on its lagged values. The AR model takes the following form: where Y t is the time series values at time t, p is the order of the lag, and Y t−i is the lagged values of Y t , with ε t as an error term in time t, and c as constant.

Autoregressive Integrated Moving Averages (ARIMA) Model
The autoregressive integrated moving averages (ARIMA) model was first proposed by Box and Jenkins in 1970, after Herman Wold introduced the ARMA model and failed to derive the likelihood function for maximum likelihood (ML) to estimate the parameters. In 1970, Box and Jenkins accomplished this finding, as outlined in the classic book Time Series Analysis. The ARIMA method has three components, using historical data through the autoregressive (AR) part and handling the stochastic factors by using the moving averages (MA) component.
The ARIMA model is expressed as ARIMA (p,d,q), where (p), (d), and (q), respectively, are the order of the AR part, the degree of differentiation, and the order of moving average part.
Mathematically, the ARIMA model can be written as follow: The Y t is the differenced variable (sometimes variable can be differenced multiple times). On the right side of Equation (15), there is the autoregressive (AR) component as the lagged values of Y t , and the moving averages component (MA) as the lagged values of the errors ε t . By using backshift notation, the ARIMA model should be as follows: where 1 − ∅ 1 B − · · · − ∅ p B p is the AR(p) component, (1 − B) d is the order of differentiation, and 1 − ϕ 1 B − · · · − ϕ q B q is the moving averages MA(q) part. The ARIMA model, in some cases, has been found to be more performant than the other complex time series models, such as VAR and VECM.
More recently, researchers have been manually fitting different parameters, as in [13][14][15]. The manual process goes from fitting different models using many orders, and then comparing the features of each model using information criteria: Akaike information criterion (AIC), Bayesian information criterion (BIC), Hannan-Quinn information criterion (HQIC), and corrected Akaike information criterion (CAIC). However, going through this process manually is time-consuming; especially for complex and large datasets, this process minimizes the risk of human error. Python's prebuilt libraries help to find the optimal parameters such as the order of integration, trend, stationarity, and seasonality in the case of seasonal data for AR/ARIMA models [36], by executing stepwise processing of hyperparameter tuning to identify the optimal parameters (such as p, d, and q for the ARIMA model). Finally, it returns a fitted AR/ARIMA model. The manual selection procedure depends on the autocorrelation function and partial autocorrelation function for ARIMA to determine, respectively, the number of MA terms and the number of AR terms; alternatively, the auto-selection process performs multiple differencing tests such as augmented Dicky-Fuller (ADF), Kwiatkowski-Phillips-Schmidt-Shin (KPSS), or Phillips-Perron (PP) automatically establish the order of differencing. The information criterion is optimized to a minimal value and to the highest for the log-likelihood; this process cycles through different integration orders to obtain the optimal set of parameters. The prebuilt libraries in Python support expending univariate time series, where the parameters can change over time. If the time series present an increasing trend during the observation period, then an increasing tuple should be specified, inversely, a decreasing tuple, when there is a declining trend during the observation time. The model responds automatically to shifting time series patterns and predicts values more accurately. The parameters used for each model are shown in Table A1.

Robust Forecasting Using Ensemble Learning
Robust forecasting can be done by merging two models or more, for example, combining an ARIMA model and LSTM results in a hybrid model. The technique of combining models, known as ensemble learning or ensemble modeling, consists of blending output predictions from different machine learning algorithms. The major advantage of using ensemble learning is that it can take many forms such as bagging (or bootstrap aggregating), boosting, adaptive boosting (AdaBoost), mixture of experts, stacked generalization, etc.
Since each method solely tends to capture different patterns and each method has its own biases and prediction errors, by combining methods, they can cancel each other's errors, leading to robust predictions. Hybrid models tend to have robust results [21] and are less overfitted. In this paper, we adopt the stacked generalization methodology, because of its flexibility, such as combining deep learning algorithms with machine learning models and artificial neural network models, and others; also setting multiple hyperparameters on the same hybrid model. The stacked generalization technique was first introduced by [37], this ensemble learning technique was used for minimizing the generalization error rate from one model or more than one model (known as first-level model or base models), by deducing the biases of errors concerning a given learning dataset. As shown in Figure 3, we combine first-level model predictions, by taking one model from conventional techniques and one model from AI-based models; hence, this blending sets out a second-level model (also called a meta-model or blending model) which uses the first-level predictions as training to determine how to blend and assign weights to the final predictions. The meta-model is constrained by using LASSO which stands for least absolute shrinkage and selection operator, developed by Tibshirani in 1996. This algorithm is based on a linear regression model that can let the meta-model learn from non-negative coefficients only, and can avoid the collinearity among the base models (first-level predictions).
Using the LASSO-learning function solves the question of which approach should have the highest weight; LASSO regression ensures that ensemble learning assigns a zero or a positive weight to each approach automatically [35]. The LASSO learns which model is the most accurate and assigns 1. Alternatively, a somewhat accurate model is assigned less than 1, and in some cases, if the first-level model appears to be inaccurate it assigns zero to each base model. Using the LASSO-learning function solves the question of which approach should have the highest weight; LASSO regression ensures that ensemble learning assigns a zero or a positive weight to each approach automatically [35]. The LASSO learns which model is the most accurate and assigns 1. Alternatively, a somewhat accurate model is assigned less than 1, and in some cases, if the first-level model appears to be inaccurate it assigns zero to each base model.

Results
The number of tourist arrivals to the Moroccan regions is forecasted only for the Marrakech-Safi region for the period 1999-2018, which is considered to be the most visited region in Morocco and can be considered to be a benchmark for the other regions. Forecasting was done using three conventional techniques (namely autoregressive model, ARIMA, and linear regression) and three AI-based models (namely LSTM, SVR, and XGBoost). Finally, combined modeling is based on the two approaches to overcome the limitations of each approach individually. The prediction accuracy of all models was assessed using three different measures: MAE, RMSE, and MAPE (see Table 2). Some of the models are implemented using Python from [38]. The parameters used for each model are shown in Table A1 in Appendix A.

Forecasting Results and Findings of the First-Level Models
By using the period 1999-2008 as a training set, the conventional techniques successfully capture an increasing trend observed in the real data (see Figure 4a), since the statistical mechanisms of the adopted traditional techniques learn from the historical data. For example, the number of tourist arrivals in 2008 depends on the number of tourist arrivals in 2007 which is captured by AR(1), leading to mapping an increasing number of tourist

Results
The number of tourist arrivals to the Moroccan regions is forecasted only for the Marrakech-Safi region for the period 1999-2018, which is considered to be the most visited region in Morocco and can be considered to be a benchmark for the other regions. Forecasting was done using three conventional techniques (namely autoregressive model, ARIMA, and linear regression) and three AI-based models (namely LSTM, SVR, and XGBoost). Finally, combined modeling is based on the two approaches to overcome the limitations of each approach individually. The prediction accuracy of all models was assessed using three different measures: MAE, RMSE, and MAPE (see Table 2). Some of the models are implemented using Python from [38]. The parameters used for each model are shown in Table A1 in Appendix A.

Forecasting Results and Findings of the First-Level Models
By using the period 1999-2008 as a training set, the conventional techniques successfully capture an increasing trend observed in the real data (see Figure 4a), since the statistical mechanisms of the adopted traditional techniques learn from the historical data. For example, the number of tourist arrivals in 2008 depends on the number of tourist arrivals in 2007 which is captured by AR(1), leading to mapping an increasing number of tourist arrivals. The evolution of the ARIMA and AR trends are close and have similar behaviors even if the two lines show some high forecasted values; however, the trend is the same as the real data. The linear regression shows significant results in terms of matching the real data graphically. Table 2  arrivals. The evolution of the ARIMA and AR trends are close and have similar behaviors even if the two lines show some high forecasted values; however, the trend is the same as the real data. The linear regression shows significant results in terms of matching the real data graphically. Table 2 provides the accuracy metrics of the three conventional methods; the linear regression model shows superior results with the lowest MAPE (7.19%), RMSE (185,612.9382), and the MAE (154,085.0944). Interestingly, it appears from Table 2 that the linear regression outperformed the ARIMA and AR techniques. Taken together with Figure 4a and Table 2, the results suggest that there is a linear pattern in the number of tourist arrivals captured by the three traditional techniques.

Results and Findings of the Second-Level Models
However, Figure 4b graphically compares the forecasting results for the Marrakech-Safi region of the three AI-based methods with the real data. The three models behave differently, as each method has its mechanism. The forecasted patterns may be explained by the fact that the algorithms try to capture the nonlinear process using their own set of learning and mapping capabilities. The unconventional methods exhibit a volatile behavior which signifies the ability to map the smooth nonlinearity process; nevertheless, previous knowledge about the regional tourism sector indicates that an increasing trend exists, while the SVR model illustrates an increasing-decreasing pattern which does not agree with the actual data.
From Table 2 we can examine the performance of the unconventional models. The LTSM model shows superior results with the lowest MAPE (6.27%), RMSE (154,514.3182), the MAE (130,324.425) values, followed by XGBoost, and then SVR. Finally, the results of the LSTM model were superior to the other models, with the minimum metrics criteria, Interestingly, it appears from Table 2 that the linear regression outperformed the ARIMA and AR techniques. Taken together with Figure 4a and Table 2, the results suggest that there is a linear pattern in the number of tourist arrivals captured by the three traditional techniques.

Results and Findings of the Second-Level Models
However, Figure 4b graphically compares the forecasting results for the Marrakech-Safi region of the three AI-based methods with the real data. The three models behave differently, as each method has its mechanism. The forecasted patterns may be explained by the fact that the algorithms try to capture the nonlinear process using their own set of learning and mapping capabilities. The unconventional methods exhibit a volatile behavior which signifies the ability to map the smooth nonlinearity process; nevertheless, previous knowledge about the regional tourism sector indicates that an increasing trend exists, while the SVR model illustrates an increasing-decreasing pattern which does not agree with the actual data.
From Table 2 we can examine the performance of the unconventional models. The LTSM model shows superior results with the lowest MAPE (6.27%), RMSE (154,514.3182), the MAE (130,324.425) values, followed by XGBoost, and then SVR. Finally, the results of the LSTM model were superior to the other models, with the minimum metrics criteria, indicating the smallest deviations between the forecasted value and the real data. Generally, according to the results of the first-level models, the LSTM technique outperformed all the AI-based models, as found by [25].
Another window to compare between both approaches, as shown in Table 3, is the total time required to implement each model. Conventional techniques are known for their statistical simplicity, and therefore require less time and computational capability.
However, the results may change when dealing with a high-dimensional dataset, where the standard techniques behave poorly and sometimes fail. Alternatively, the AI-based models require time for learning and mapping all the possible patterns existing in the data, using complex equations that take more computational sources. Here, only the traditional and the AI-based models are compared. Introducing the hybrid model is not wise, due to the fact that the time required to implement ensemble learning relies on the time it takes the base model to forecast, meaning we need to take into consideration the time required to implement base models. However, the results for LSTM_ARIMA were anticipated since this model combines ARIMA as the most used traditional technique [13,14,20] and LSTM an advanced neural network algorithm.
Finally, the hybrid models illustrated in Table 4 and Figure 5 outperformed the individual models, except for the XGBoost_ARIMA, this hybrid model shows an increase in the accuracy metrics, where the MAPE of the XGBoost model goes from 9.69% individually to 9.80% when combined with the ARIMA model (the same increase observed for the rest of accuracy measures). Consequently, the hybrid models sometime may fail to improve accuracy, due to inadequacies between the different mechanisms of each base model. Regarding second-level modeling Figure 6, the results for LSTM_AR indicated it was the most significant ensemble learning model, where adding a model capable of mapping linear pattern and another model capable of mapping nonlinear behavior improved the accuracy, as found by [20,27]. After combining both approaches using ensemble learning, the model learns to switch between the two patterns, i.e., to the linear pattern when the coefficient related to the traditional techniques is superior to the one for the unconventional model, and to the nonlinear pattern when the inverse happens, thus, leading to better mapping of the real data, and increasing the accuracy significantly. Regarding second-level modeling Figure 6, the results for LSTM_AR indicated it was the most significant ensemble learning model, where adding a model capable of mapping linear pattern and another model capable of mapping nonlinear behavior improved the accuracy, as found by [20,27]. After combining both approaches using ensemble learning, the model learns to switch between the two patterns, i.e., to the linear pattern when the coefficient related to the traditional techniques is superior to the one for the unconventional model, and to the nonlinear pattern when the inverse happens, thus, leading to better mapping of the real data, and increasing the accuracy significantly.  Regarding second-level modeling Figure 6, the results for LSTM_AR indicated it was the most significant ensemble learning model, where adding a model capable of mapping linear pattern and another model capable of mapping nonlinear behavior improved the accuracy, as found by [20,27]. After combining both approaches using ensemble learning, the model learns to switch between the two patterns, i.e., to the linear pattern when the coefficient related to the traditional techniques is superior to the one for the unconventional model, and to the nonlinear pattern when the inverse happens, thus, leading to better mapping of the real data, and increasing the accuracy significantly.

Discussion
According to our findings and those of previous studies, ensemble learning is promising since it takes advantage of every single model that is included, which leads to minimizing forecasting errors and outperforming individual standard models [39]. Among the previous works that used hybrid models, none of them found that conventional techniques outperformed the strategy of hybrid models [39,40], which is considered to be groundbreaking for the technique of ensemble learning. Hybrid models became a new trend in tourism forecasting between 2008 and 2017, as reported [39], whereas, prior to 2008, only four studies had examined the use of combining models in the tourism literature.

Discussion
According to our findings and those of previous studies, ensemble learning is promising since it takes advantage of every single model that is included, which leads to minimizing forecasting errors and outperforming individual standard models [39]. Among the previous works that used hybrid models, none of them found that conventional techniques outperformed the strategy of hybrid models [39,40], which is considered to be groundbreaking for the technique of ensemble learning. Hybrid models became a new trend in tourism forecasting between 2008 and 2017, as reported [39], whereas, prior to 2008, only four studies had examined the use of combining models in the tourism literature.
However, this technique requires more attention regarding which base model should we include, as some data can be affected by both the linear and nonlinear processes, which need a set of combined techniques capable of dealing with both behaviors [20,27].
Sometimes, including a time series technique in a hybrid model can be crucial, especially when a periodic pattern is present in a dataset, such as seasonality, trends, and cyclicity, where the traditional time series techniques have gained renowned respect [5,10,11]. However, time series combined, for example, with a neural network framework will significantly improve accuracy [20], or add features capable of mapping those repeated cycles despite having a short dataset [22]. Some researchers have used advanced techniques to increase the number of observations in the case of having an insufficient dataset, which have alleviated issues related to a lack of sufficient data. For example, using a rolling window [5], where the purpose of this strategy was to create "new" (the word new here adds many questions when dealing with traditional techniques such as time series) observations based on a previously observed sample. However, adopting those strategies in the case of a time series model (as a base model) will certainly change the statistical characteristics (trend, stationarity, etc.) of the created subsamples and the rolling windows, leading to inadequacy through the subsamples. Its like creating multiple variables with no primary processing steps (unit root test, autocorrelation, normality test, etc.) used in time series analysis inside the original variable. Since some subsamples are stationery and others are not, it is a matter of chance, which is unacceptable in statistical inference.
Introducing traditional techniques among the hybrid components may be challenging, especially in the preprocessing step of the data, as those methods need more accurate specifications to be implemented for forecasting, such as ensuring the stationarity, autocorrelation, increasing/decreasing trend, linearity, standard asymptotic, and the normal distribution, which are essential when dealing with time series assumptions. In addition, AI-based techniques also require attention when choosing the optimal parameters [41].
Which model should have the highest weight, is another question with respect to using hybridization techniques Some methods (LASSO, SWITCH, etc.) may have the answer. For example, the SWITCH algorithm tests the difference between hybrid models that include different components. The weight will be equal when the difference is statistically insignificant, otherwise, the best base model should be used individually if the difference is significant [39]. Adopting the LASSO function, as we did, ensures that ensemble learning will assign a zero or a positive weight to each approach automatically. The LASSO learns which model is the most accurate and assigns 1. Alternatively, a somewhat accurate model is assigned less than 1, and in some cases, if the first-level model appears to be inaccurate it assigns zero to penalize the error occurred by this base model.
None of the reviewed studies appeared to have discussed the total time required to implement each model, which is considered to be crucial when forecasting in real time as in [24] or dealing with high-dimensional data. According to our findings, conventional techniques tend to perform well in this competition, compelled by less processing complexity. In contrast, the AI-based models tend to take substantially more time due to their complexity when learning all the possible patterns, which requires more computational resources. The situation may be the inverse when dealing with high-dimensional data, where the conventional techniques failed to handle the task.
Tourism demand forecasting is a key instrument for policymakers. It has attracted the attention of researchers since the tourism industry plays a crucial role in the economic development of some regions such as the Marrakech-Safi region. Developing more accurate results helps governments, policymakers, investors, and tourism management to prepare the necessary infrastructure (roads, tourist zones, hotels, hostels, etc.) capable of serving the number of tourists by developing anticipated strategies. The pursuit for absolute accuracy in tourism demand forecasting has led researchers to use different approaches. Time series, econometric models, and AI-based techniques continue to attract scholars; however, a new type of model appears to be superior to all the previous approaches, taking advantage of all techniques in a complex hybrid model.
It is beyond the scope of this study to capture the impact of COVID-19 due to the restricted regional data availability after 2018, which limits the scope of analysis. Despite the success demonstrated by the three approaches, a significant limitation is the small sample size, which limits the conventional methods from finding a trend and expressive relationship. Using high-frequency data (monthly, seasonal, or weekly) could surely overcome this obstacle, and this issue should be anticipated and addressed in future research. Some studies have used non-official data such as using online reviews and internet big data from a search engine [2,23], or business sentiment surveys [24]; these new approaches of gathering data could open a window to limitations related to data availability.

Conclusions
In this study, we set out to forecast the number of tourist arrivals to the Marrakech-Safi region using three AI-based models versus three other conventional techniques, we evaluate their forecasting accuracy, and then we combine the two approaches in hybrid models to overcome the limitations of each technique. The findings showed that the AR model had the most significant results among the conventional techniques, while the LSTM outperformed all the AI-based models, which showed significant output values and successfully mapped the pattern of the real data. The second major finding was that the hybrid model that combined LSTM and AR models was consistently more accurate as compared with the other combined techniques and the base models.
Taken together, these results suggest that forecasting tourist arrivals cannot be done using a conventional or AI-based model solely. However, for results with high significance, researchers should combine different techniques to overcome the limitation of each technique individually, since each approach captures a unique feature of the data. The contribution of this study has been to show the differences between the techniques of conventional models (such as time series and econometric-based models) as compared with artificial intelligence-based models, as having two different methodologies, since they have different principles and mechanisms.
Further research could explore other possible hybrid models such as combining more than two models in one ensemble learning technique with different weighting algorithms Other studies could assess the difference between different techniques of ensemble learning when using mixed-frequency data which has been shown to be the new trend in tourism demand forecasting. Generally, more research is needed in explainable AI-based modeling.  Data Availability Statement: The tourist arrival dataset at the regional level and materials used in this paper are available at the following GitHub repository https://github.com/kol12303/Reseach_ Paper (accessed on 2 February 2022).

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A   Table A1. The parameters used for the models.