Tourism Demand Forecasting Based on an LSTM Network and Its Variants

: The need for accurate tourism demand forecasting is widely recognized. The unreliability of traditional methods makes tourism demand forecasting still challenging. Using deep learning approaches, this study aims to adapt Long Short-Term Memory (LSTM), Bidirectional LSTM (Bi-LSTM), and Gated Recurrent Unit networks (GRU), which are straightforward and efﬁcient, to improve Taiwan’s tourism demand forecasting. The networks are able to seize the dependence of visitor arrival time series data. The Adam optimization algorithm with adaptive learning rate is used to optimize the basic setup of the models. The results show that the proposed models outperform previous studies undertaken during the Severe Acute Respiratory Syndrome (SARS) events of 2002– 2003. This article also examines the effects of the current COVID-19 outbreak to tourist arrivals to Taiwan. The results show that the use of the LSTM network and its variants can perform satisfactorily for tourism demand forecasting. during the ﬁve-year period from 2015 to 2020. The results show that the most accurate model is Bi-LSTM.


Introduction
Perishability is one of the most important characteristics of the tourism industry, making the need for accurate tourism demand (TD) forecasting crucial [1]. Governments and organizations always need to accurately estimate the expected TD in order to make valid policy planning, tactical and operational decisions [2,3]. Accurate TD forecasting can effectively boost economic development and employment [4]; hence, the need for accurate TD forecasting is widely recognized [5]. Quantitative TD forecasting techniques can be divided into time series models, econometric approach and artificial intelligence (AI) techniques [6]. However, no single technique outperforms others on all scenarios in terms of accuracy.
Time series models have been very popular for TD forecasting, and their advantage lies in the validity and efficiency of autoregressive integrated moving average (ARIMA) and its variants [7]. Most ARIMA variants are subject to some limitations, such as the assumption of a linear relation between future and past time step values, and the number of observations [8]. Therefore, when solving complex nonlinear problems, the estimates obtained may be inaccurate.
Econometric methods can determine the cause-and-effect relation between TD dependent variables and independent variables [9]. However, most econometric models have several limitations. For instance, the independent variables are either exogenous or endogenous, and are decided in advance before the modelling process [10].
AI techniques including machine learning and deep learning are becoming increasingly popular in TD forecasting [11,12]. Among the AI techniques, artificial neural networks (ANN) provide a potential alternative for solving complex nonlinear problems. Numerous studies showed that ANNs generally outperformed other methods [3,5,[13][14][15][16]. In general, AI technology can approximate arbitrarily complex nonlinear dynamic systems without any initial or extra information about data such as distribution. This brings considerable benefits and simplifications to modelling, but on the other hand, AI techniques hardly provide any information about potential determinism or even process understanding. However, since we are interested in TD patterns rather than physical processes, this model property is not the main disadvantage.
With the advancement of ANNs, researchers find that deep learning methods, especially recurrent neural network (RNN) architectures, are more suitable than feedforward neural networks in dealing with the complexity of time series [17]. However, RNN training has the problem of vanishing gradients, so various variants of RNN models have been proposed, such as long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM) and gated recurrent unit (GRU). The aforementioned networks are different from other methods in that they back-propagate through immediate historical data and current data, and are more suitable for detecting development trends. The architecture of these networks overcame the weaknesses of traditional RNNs in capturing long-term dependencies, as shown by Bengiot et al. [18]. With this feature, these networks have been widely used to solve time series forecasting problems [11,[19][20][21][22][23][24][25].
According to Annual Survey Report on Visitors Expenditure and Trends in Taiwan, the total revenue of international tourism increased from US$5936 million in 2008 to US$14,411 million in 2019. In recent years, international tourism has become a key service industry in Taiwan. Since 2008, Taiwan's international tourism revenue has exceeded domestic tourism revenue. If this momentum can be maintained, it will contribute to the development of the tourism industry and economic growth in the future.
The outstanding application results of LSTM networks and its variants in different fields show that they can not only seize the changing data trend, but also describe the dependence of time series data. Therefore, this research tries to use an LSTM network and its variants to predict Taiwan's TD.
The number of passengers is still the most popular TD measure over last decades. Since the time series model only requires historical observation of a variable, the cost of data collection and model estimation is low. Hence, this study adapts an LSTM network and its variants to forecast Taiwan's TD. In order to validate the model, a data set including the severe acute respiratory syndrome (SARS) outbreak threatening tourism demand from November 2002 to June 2003 was used to compare the prediction results of the models reported in the other papers. In view of the strong autoregressive pattern of the number of tourists [26], data from the SARS outbreak was used to train the network to predict the impact of the current COVID-19 epidemic on the number of tourists in Taiwan.
The remainder of this paper is organized as follows. Section 2 describes the LSTM, Bi-LSTM, and GRU networks. Section 3 presents data description. Section 4 describes the results and discussion for TD forecasting before our conclusions are provided in Section 5.

LSTM Network
In RNN, the output can be given back to the network as input, thereby creating a loop structure. RNNs are trained through backpropagation. In the process of backpropagation, RNN will encounter the problem of vanishing gradient. We use the gradient to update the weight of the neural network. The problem of vanishing gradient is when the gradient shrinks as it propagates backwards in time. Therefore, the layers that obtain small gradients will not learn, butwill instead cause the network to have short-term memory.
The LSTM network was introduced by Hochreiter and Schmidhuber [27] to alleviate the problem of vanishing gradients. LSTMs can use a mechanism called gates to learn long-term dependencies. These gates can learn which information in the sequence is important to keep or discard. LSTMs have three gates: input, forget and output. Figure 1a shows the architecture of LSTM cell [22,28,29]. The horizontal line between C t−1 and C t is called the cell state. This is the core of the LSTM model, where pointwise addition and multiplication are performed to add or delete information from the memory. These operations are performed using the input and forget gate of the LSTM block, which also contains the output "tanh" activation function. The computations inside the LSTM neurons are shown as follows [27]: Forget gate: Input gate: Output gate: Process input: Cell update: where σ refers to sigmoid function, h t−1 represents the output of pervious cell state, x t represents the input of current cell state, are the weight matrices and bias of the forget, input and output gates, respectively. W C and b C are the weights and bias of the cell state, and "·" means point-wise multiplication. o t is used to evaluate which part of the cell state to be exported, and h t calculates the final outputs. Figure 1b shows the general structure of Bi-LSTM network. One input sequence is processed from right to left, and the other is processed from left to right. This structure allows the model to learn the input sequence in both directions. The interpretations of the forward and backward LSTM network output are combined to generate predictions at the next time step. By using time series data and its reverse copy to make predictions, it can provide supplementary context for the model to learn problems faster and more effectively [30].

Bi-LSTM Network
Hai et al. [21] surveyed different variants of LSTM (Vanilla, Stacked, Bi-directional), which were applied to the stock prices of 20 companies on the VN Index Stock Exchange during the five-year period from 2015 to 2020. The results show that the most accurate model is Bi-LSTM.  [22], Graves [28] and Olah [29]. Figure 1. Structure of network according to Ko et al. [22], Graves [28] and Olah [29].

GRU Network
The GRU network proposed by Cho et al. [31] is a modified LSTM model with two gates, so that each cyclic unit can adaptively seize the dependencies of different time scales. Different from the LSTM network, the GRU structure is not so uncomplicated, but its usefulness has not been reduced, and sometimes even a little better than LSTM [32].
GRU eradicates the cell state and applies the hidden state to transmit information. Another distinction between GRU and LSTM is that the forget gate and input gate in LSTM are combined into an update gate. Figure 1c shows the architecture of GRU cell. The mathematical operations inside the GRU neurons are shown as follows [31]: Reset gate: Update gate: Process input: Output: where b r is the bias vector of the reset gate. Its function is the same as that in the LSTM, that is, the smaller r t is, the less information passes. b z and b h are the bias of the update gate and cell state, respectively. W r , W z and W h are the weight matrices of the reset gate, update gate, and cell state, respectively. Mean squared error was used as the loss function and "Adam" optimizer [33] was used to find the optimum weights for the networks. All the models were implemented using Keras in Python [34], and a Tensorflow backend [35]. In this study, during the training stage, all possible configurations of manually defined parameter subsets were tried to choose parameters for the network.
The data was normalized to be between 0 and 1. After using LSTM, Bi-LSTM and GRU models to make predictions, the predicted data was inverted and restored to the original state. Equation (11) describes the function used in this study to normalize the dataset: (11) where x t is the input time series, x t is the normalized time series, and x min and x max are the minimum and maximum values of the time series respectively. In order to evaluate the forecasting performances of the model, the root mean squared error (RMSE) was used: where y m and y * m are the observed and predicted values, respectively; M is the number of data samples.

Data
The forecast target is the number of tourists visiting Taiwan each month. The data obtained from the official website (https://stat.taiwan.net.tw/inboundSearch (accessed on 17 July 2021)) of Taiwan Tourism Bureau, Ministry of Transportation and Communications extend from January 1984 to May 2021. This study uses two series datasets to verify the feasibility and effectiveness of the proposed forecasting models. Series 1 is split into a training dataset, covering the period from January 1984 to August 1998, and testing dataset, for the period from September 1998 to September 2005, as shown in Figure 2. The training and testing ratio is a ratio of 70:30. Series 2 is divided into training dataset, covering the  Table S1), as shown in Figure 3. on 17 July 2021)) of Taiwan Tourism Bureau, Ministry of Transportation and Communications extend from January 1984 to May 2021. This study uses two series datasets to verify the feasibility and effectiveness of the proposed forecasting models. Series 1 is split into a training dataset, covering the period from January 1984 to August 1998, and testing dataset, for the period from September 1998 to September 2005, as shown in Figure 2. The training and testing ratio is a ratio of 70:30. Series 2 is divided into training dataset, covering the period from January 1984 to March 2010, and testing dataset, for the period from April 2010 to May 2021 (Please see in Table S1), as shown in Figure 3.   The period of testing dataset of data series 1 covered the SARS outbreak, which had a great impact on Taiwan's TD [36]. The period of testing dataset of data series 2 covers the outbreak of COVID-19. The significance of this research lies in the model's ability to predict time series of catastrophic events, such as the SARS and COVID-19 outbreaks.

Series 1
The LSTM, Bi-LSTM, and GRU networks yielded RMSEs of 29,537, 30,264, and 30,531 for the testing dataset, respectively. The results are compared with those of the previous fuzzy time series studies [37][38][39], as shown in Table 1. RMSE of Huarng et al. [39] is smaller than those of Chen [37] and Huarng et al. [38].  The period of testing dataset of data series 1 covered the SARS outbreak, which had a great impact on Taiwan's TD [36]. The period of testing dataset of data series 2 covers the outbreak of COVID-19. The significance of this research lies in the model's ability to predict time series of catastrophic events, such as the SARS and COVID-19 outbreaks.

Series 1
The LSTM, Bi-LSTM, and GRU networks yielded RMSEs of 29,537, 30,264, and 30,531 for the testing dataset, respectively. The results are compared with those of the previous fuzzy time series studies [37][38][39], as shown in Table 1. RMSE of Huarng et al. [39] is smaller than those of Chen [37] and Huarng et al. [38].  [39], LSTM, Bi-LSTM, and GRU are 61,863, 59,276, 59,480 and 59,369 respectively. These RMSEs are almost twice the corresponding RMSEs of the entire period, indicating that TD forecasting during this period is difficult. However, the RMSEs of this study achieved 4% better error rates than that of the aforementioned research, indicating that the prediction model can be used to predict time series with catastrophic events. The comparison results show that the LSTM model is slightly better than that of the Bi-LSTM, and GRU in terms of RMSE. The RMSEs for the entire period, including the SARS period, are compared in Table 1.
The reported fuzzy time series models have achieved successful prediction results under the modelling frameworks. However, the models have a potential vulnerability (without long-term dependency). Due to the lack of "memory" functions in the structure, the model is more sensitive to short-term relationships than long-term dependencies, and cannot capture some important recurring features. Furthermore, deep learning approaches are non-parametric and are more generalizable without fuzzification and de-fuzzification.

Series 2
The RMSEs for the testing dataset of the LSTM, Bi-LSTM and GRU networks are 100,410, 102,754 and 105,768, respectively, showing that they have similar performance. Even so, the LSTM model has slightly higher accuracy compared to Bi-LSTM and GRU. The model training vs. validation loss and actual data vs. prediction are depicted in Figures 5-10. The aforementioned networks successfully identify the future trends and can emulate instances of extreme arrival dips.          SARS and COVID-19 are two catastrophic events that profoundly affect the world's TD [40]. Polyzos et al. [41] used the data from the SARS epidemic outbreak to train an LSTM network, similar to the approach of Law et al. [11]. In the first training phase, the error is returned to the network to calibrate the model. In addition, errors will continue to be used in the gates of the network. Moreover, the LSTM network does not react to the lags between events in the time series. Therefore, when we try to derive unknown prediction models, the LSTM algorithm works better than other ANNs (such as hidden Markov, support vector regression, etc.) or other prediction techniques (such as ARIMA) [11].  SARS and COVID-19 are two catastrophic events that profoundly affect the world's TD [40]. Polyzos et al. [41] used the data from the SARS epidemic outbreak to train an LSTM network, similar to the approach of Law et al. [11]. In the first training phase, the error is returned to the network to calibrate the model. In addition, errors will continue to be used in the gates of the network. Moreover, the LSTM network does not react to the lags between events in the time series. Therefore, when we try to derive unknown prediction models, the LSTM algorithm works better than other ANNs (such as hidden Markov, support vector regression, etc.) or other prediction techniques (such as ARIMA) [11]. SARS and COVID-19 are two catastrophic events that profoundly affect the world's TD [40]. Polyzos et al. [41] used the data from the SARS epidemic outbreak to train an LSTM network, similar to the approach of Law et al. [11]. In the first training phase, the error is returned to the network to calibrate the model. In addition, errors will continue to be used in the gates of the network. Moreover, the LSTM network does not react to the lags between events in the time series. Therefore, when we try to derive unknown prediction models, the LSTM algorithm works better than other ANNs (such as hidden Markov, support vector regression, etc.) or other prediction techniques (such as ARIMA) [11].

Conclusions
The purpose of this study is to adapt an LSTM network and its variants to improve Taiwan's TD forecasting. The results show that the proposed models are more simple and effective than others to forecast nonlinear data with shocks. These techniques reveal adequate to other catastrophic situations that can affect the tourism industry.
To overcome statistical complexities through analysing time series, this study empirically analyses the accuracy of LSTM, Bi-LSTM and GRU models applied in Taiwan's TD forecasting with shocks-namely, the SARS epidemic and the COVID-19 pandemic. The forecasting models of deep learning perform better than the other three fuzzy time series models when considering the period of catastrophic events. The results show that the use of the LSTM network and its variants can be applied to the arrival time series, given its strong autoregressive nature, using a calibration network with training data from a similar a past event-namely, the SARS epidemic. From the global error perspective, the performance of the LSTM model is slightly better than those of the Bi-LSTM and GRU in terms of the RMSE value.
In other destinations with similar techniques, Polyzos et al. [41] employed an LSTM network to forecast the effect of the current pandemic COVID-19 outbreak on the arrivals of Chinese tourists to the USA and Australia. Kulshrestha et al. [42] ascertained the validity of the Bayesian Bi-LSTM model using the TD data of Singapore. Therefore, the robustness of TD forecasting using an LSTM network and its variants is not country-specific.
However, the proposed models are unable to interpret TD from the economic perspective, and therefore provide little help in policy evaluation. Incorporating additional explanatory variables such as weather data and search engine data [11,43] is promising to increase the accuracy of TD forecasting. Therefore, establishing a comprehensive ability to summarize variables selection should be the research direction going forward.