Impact of Uncertainty in the Input Variables and Model Parameters on Predictions of a Long Short Term Memory (LSTM) Based Sales Forecasting Model

A Long Short Term Memory (LSTM) based sales model has been developed to forecast the global sales of hotel business of Travel Boutique Online Holidays (TBO Holidays). The LSTM model is a multivariate model; input to the model includes several independent variables in addition to a dependent variable, viz., sales from the previous step. One of the input variables, “number of active bookers per day”, is estimated for the same day as sales. This need for estimation requires the development of another LSTM model to predict the number of active bookers per day. The number of active bookers is variable, so the predicted is used as an input to the sales forecasting model. The use of a predicted variable as an input variable to another model increases the chance of uncertainty entering the system. This paper discusses the quantum of variability observed in sales predictions for various uncertainties or noise due to the estimation of the number of active bookers. For the purposes of this study, different noise distributions such as normalized, uniform, and logistic distributions are used, among others. Analyses of predictions demonstrate that the addition of uncertainty to the number of active bookers via dropouts as well as to the lagged sales variables leads to model predictions that are close to the observations. The least squared error between observations and predictions is higher for uncertainties modeled using other distributions (without dropouts) with the worst predictions being for Gumbel noise distribution. Gaussian noise added directly to the weights matrix yields the best results (minimum prediction errors). One possibility of this uncertainty could be that the global minimum of the least squared objective function with respect to the model weight matrix is not reached, and therefore, model parameters are not optimal. The two LSTM models used in series are also used to study the impact of corona virus on global sales. By introducing a new variable called the corona virus impact variable, the LSTM models can predict corona-affected sales within five percent (5%) of the actuals. The research discussed in the paper finds LSTM models to be effective tools that can be used in the travel industry as they are able to successfully model the trends in sales. These tools can be reliably used to simulate various hypothetical scenarios also.


LSTM Architecture
The central idea behind the LSTM architecture [23] is a memory cell, which can maintain its state over time, and nonlinear gating units, which regulate the information flow into and out of the cell. A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. A schematic of a simple LSTM block can be seen in Figure 1. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell.
Mach. Learn. Knowl. Extr. 2020, 2 FOR PEER REVIEW 3 The central idea behind the LSTM architecture [23] is a memory cell, which can maintain its state over time, and nonlinear gating units, which regulate the information flow into and out of the cell. A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. A schematic of a simple LSTM block can be seen in Figure 1. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell. Input gate determines the new data that get stored in the cell through a sigmoid layer followed by a tanh layer. The initial sigmoid layer, called the "input door layer", identifies the values that will be modified. Next, a tanh layer makes a vector of new candidate values that could be added to the state.
The forget gate decides on the information that needs to be discarded from the cell state using a sigmoid layer that outputs a number between 0 and 1, where 1 means "completely keep this", and 0 implies "completely ignore this".
Output gate determines the information (yield) that goes out of each cell. The yielded value will be based on the cell state along with the filtered and newly added data. Let ∈ ℛ be the input vector at time t, T be the number of LSTM blocks, and M be the number of inputs. Then, we get the following weights for an LSTM layer: Input weights: , , , ∈ ℛ for activation, input, forget, and output gate, respectively.
Symbols of matrix products: Input gate determines the new data that get stored in the cell through a sigmoid layer followed by a tanh layer. The initial sigmoid layer, called the "input door layer", identifies the values that will be modified. Next, a tanh layer makes a vector of new candidate values that could be added to the state.
The forget gate decides on the information that needs to be discarded from the cell state using a sigmoid layer that outputs a number between 0 and 1, where 1 means "completely keep this", and 0 implies "completely ignore this".
Output gate determines the information (yield) that goes out of each cell. The yielded value will be based on the cell state along with the filtered and newly added data. Let x t ∈ R M be the input vector at time t, T be the number of LSTM blocks, and M be the number of inputs. Then, we get the following weights for an LSTM layer: Input weights: W xs , W xi , W x f , W xo ∈ R T×M for activation, input, forget, and output gate, respectively. Recurrent weights: W hs , W hi , W h f , W ho ∈ R T×T for activation, input, forget, and output gate, respectively. Bias weights: b s , b i , b f , b o ∈ R T for activation, input, forget, and output gate, respectively. Symbols of matrix products: : Represents the elementwise product or Hadamard product. ⊗: Represents the outer product. ·: Represents the inner product.
The gates are defined as: Input activation: a t = tanh(W xs ·x t + W hs ·out t−1 + b s ); (1) Input gate:

Backpropagation:
The deltas inside the LSTM block are then calculated as The updates in the weights can be formulated as dW xo = used as inputs to make predictions at the next time step. Both the models (active booker count and sales forecast) have two LSTM layers of 100 neurons each, followed by an output dense layer with one neuron.
has two underlying LSTM models, as shown in Figure 2. The first model predicts the number of active bookers on a given day, and the prediction is based on the daily lag of active bookers, yearly lag of active bookers, day of the week, and month of the year variables. The output of the bookers model was used as an input in the second LSTM model, which forecasted the global sales. The sales forecasting model had the inputs such as sales from the previous step, yearly lag of sales (accounts for the seasonality), day of the week, month of the year, active bookers count, and sales per active booker. Each sequence of dataset fed into an LSTM cell that consisted of the previous seven (7) days of data. Both the models made predictions one time step [24] at a time, and these predictions were used as inputs to make predictions at the next time step. Both the models (active booker count and sales forecast) have two LSTM layers of 100 neurons each, followed by an output dense layer with one neuron. To model uncertainty in a neural network model, there could be several approaches e.g., Monte Carlo [25] simulation, Bayesian Neural Network [26], and use of Dropouts in between the LSTM layers. A study conducted by Yarin Gal and Zoubin Ghahramani [16] showed how uncertainty can be modeled with dropouts in Neural Networks to improve the performance of log-likelihood and RMSE compared to existing state-of-the-art methods. In deep neural networks, dropout is a technique that is used to avoid overfitting. Figure 3 shows a high-level pictorial representation of the three components of the model where uncertainties could lie. To model uncertainty in a neural network model, there could be several approaches e.g., Monte Carlo [25] simulation, Bayesian Neural Network [26], and use of Dropouts in between the LSTM layers. A study conducted by Yarin Gal and Zoubin Ghahramani [16] showed how uncertainty can be modeled with dropouts in Neural Networks to improve the performance of log-likelihood and RMSE compared to existing state-of-the-art methods. In deep neural networks, dropout is a technique that is used to avoid overfitting. Figure 3 shows a high-level pictorial representation of the three components of the model where uncertainties could lie. For the purposes of this study, the predictions were made and compared with actual observations using the above models ( Figure 2) and the following approaches (cases).
Case 1: Deterministic approach: In this approach, dropouts are not used at the time of predictions.
Case 2: Stochastic dropout approach: In this approach, dropouts [27] are used at both training and prediction stages. Three combination of models are run for stochastic dropout approach, viz. A recurrent dropout with a dropout rate of 20% and a kernel dropout with dropout rate of 10% in the LSTM layers were used. Figure 4 shows a schematic representation of dropouts in neural network layers. For the purposes of this study, the predictions were made and compared with actual observations using the above models ( Figure 2) and the following approaches (cases).
Case 1: Deterministic approach: In this approach, dropouts are not used at the time of predictions. Case 2: Stochastic dropout approach: In this approach, dropouts [27] are used at both training and prediction stages. Three combination of models are run for stochastic dropout approach, viz. a.
The dropouts are only used in the active booker count model and not in the sales model at the time of prediction, b.
The dropouts are only used in the sales model and not in the active booker count model at the time of prediction, c.
The dropouts are used in both the sales and the active booker count models at the time of prediction.
A recurrent dropout with a dropout rate of 20% and a kernel dropout with dropout rate of 10% in the LSTM layers were used. Figure 4 shows a schematic representation of dropouts in neural network layers. and prediction stages. Three combination of models are run for stochastic dropout approach, viz. a. The dropouts are only used in the active booker count model and not in the sales model at the time of prediction, b. The dropouts are only used in the sales model and not in the active booker count model at the time of prediction, c. The dropouts are used in both the sales and the active booker count models at the time of prediction.
A recurrent dropout with a dropout rate of 20% and a kernel dropout with dropout rate of 10% in the LSTM layers were used. Figure 4 shows a schematic representation of dropouts in neural network layers. Case 3: Stochastic noise in predicted active booker count and sales: In this approach, instead of using dropouts at the time of prediction, various noise distributions are used to add uncertainty in the models. Uncertainty can exist in both active booker count and sales forecasting models. Gaussian, uniform, triangular [28], logistic [29], and Gumbel [30] distributions are used for the noise Case 3: Stochastic noise in predicted active booker count and sales: In this approach, instead of using dropouts at the time of prediction, various noise distributions are used to add uncertainty in the models. Uncertainty can exist in both active booker count and sales forecasting models. Gaussian, uniform, triangular [28], logistic [29], and Gumbel [30] distributions are used for the noise inputs. While adding noise in the models, the standard deviation is kept the same as observed in the stochastic dropout models with 0 mean. Gaussian, uniform, and triangular noises are symmetric distributions with around 0 mean. Logistic and Gumbel distributions are skewed towards a nonzero positive mean and are used to model extreme values. The other distributions such as log-normal [31], exponential [32] distributions are also considered but not used because they only add a positive noise in the model. Then the three combination of models as described in the stochastic dropout approach (Case 2) were run for each of the above five (5) noise distributions.
Case 4: Stochastic noise on weights: In this approach [33][34][35], Gaussian noise is added to the model weights (model constants). As described in the previous section, there are two LSTM layers in each model. The Gaussian noise is added in two ways, viz. Then the three combination of models as described in the stochastic dropout approach (Case 2) were run for each of the two cases.
Historical, daily global hotels sales data from 1 January 2017 to 14 January 2020 were used for training the model. The forecasting models were trained on an 8-cpu Ubuntu Linux server with 32 gigabyte memory The percentage error between predicted sales vs. actual sales for the month of January 2020 was used as the error metric for the comparison of the performance of various models. This is referred to as observed error in Table 1

Tests and Results
The models were evaluated on the total sales predicted for the month of January 2020 with the starting prediction date of 15 January 2020. Results of various model runs for conditions explained in the previous section are summarized in Table 1.
A quick look at the data shows that the best predictions are for the stochastic model with noise in weights (model constants) and worst for the case where noise is embedded in the input dataset. Within the latter case, worst predictions were observed for Gumbel noise distribution that modeled Generalized Extreme Value distribution. The predictions suggest that noise in the input dataset is not related to extremes and that no extreme (extraordinarily high or low) sale will happen. The sales predictions for models with noise in input variables (with the exception of Gumbel distribution) are very similar to the sales predictions of the deterministic model. This can be confirmed by analyzing the p-value of the two tailed t-test [36]. These results suggest that the current dataset does not have too much variation in input values and that active booker count and sales are close to being deterministic. In other words, there is very little uncertainty in the input dataset, and perturbation in the values does not alter the results (output sales forecast) significantly. Another possibility could be that noise in the input dataset does not follow any of the mathematical distributions used.
While the noise in input variables does not yield better predictions when compared to the deterministic model, randomly dropping the hidden units (neurons or cells) at each update during training using the dropout functionality of the LSTM model seems to improve predictions. The best predictions were observed when dropouts were applied to simulate uncertainty in both the active booker count and sales. This suggests that variability in the actual dataset is reduced by filtering out extreme values leading to better predictions. This contrasts with the case where every data point (and neuron) is included in training the LSTM models but with implicit uncertainty as demonstrated above.
The next step is to analyze the uncertainty in the model weights (model constants) coupled with dropouts in neural network layers, and its impact on sales predictions. The results summarized in Table 1 show that when noise was added to the model weights either as absolute value at 0.1 and 0.2 or as 10% and 20% deviation from the mean, the predictions were closest to the actual sales. The best observations were made for the case when noise with a standard deviation of 0.2 was added to the weights of both the active booker count and sales LSTM models. The p-values also indicate that predictions were significantly different from those of the deterministic model. Uncertainty in model weights seems to suggest that model convergence during training has more room to be worked on, or that the number of neurons and LSTM layers was not adequate. It is also possible that shallowness of the LSTM model in terms of fewer neurons and LSTM layers made model weights less deterministic.
For confidentiality reasons, sales numbers in this paper were scaled between 0 and 1; however, the variations in the actual and predicted sales numbers were preserved. The charts in Figure 5 show a comparison of predicted vs. actual sales on a daily level for the month of January 2020 for various versions of deterministic and stochastic models. Charts in Appendices A and B show the results of all the remaining possible combinations (active booker count only, sale only, and active booker count and sales) for the four cases given above.
Towards the end of the month (25 January onwards), we observed a deviation between predicted vs. actual sales. This happened because of the outbreak of corona virus. This had an impact on sales. The model overpredicted the sales because of the long-term memory, and it needed more data to build short term memory to realize the drastic impact of the virus on sales.
While it is worthwhile to understand the impact of uncertainty and noise in data on predictions, it would be interesting to extend the study to analyze the impact of corona virus spread on sales. One can determine the loss in sales by letting the model predict sales in the corona-free environment and then compare it to actual sales. Several such "what-if" simulations can be conducted using the models developed.

Impact of Corona Virus Outbreak on February 2020 and March 2020 Sales
Models with the best predictions, as determined in the previous section, were used to predict the sales for the months of February and March 2020. In other words, the following models were used: 1.
Stochastic dropout model with both active booker count and sales uncertainty; 2.
Noise in weights with 0.2 standard deviation in active booker count and sales models; 3.
Noise in weights with 20% standard deviation in active booker count and sales models.   Towards the end of the month (25 January onwards), we observed a deviation between predicted vs. actual sales. This happened because of the outbreak of corona virus. This had an impact on sales. The model overpredicted the sales because of the long-term memory, and it needed more data to build short term memory to realize the drastic impact of the virus on sales.
While it is worthwhile to understand the impact of uncertainty and noise in data on predictions, it would be interesting to extend the study to analyze the impact of corona virus spread on sales. One can determine the loss in sales by letting the model predict sales in the corona-free environment and then compare it to actual sales. Several such "what-if" simulations can be conducted using the models developed.

Impact of Corona Virus Outbreak on February 2020 and March 2020 Sales
Models with the best predictions, as determined in the previous section, were used to predict the sales for the months of February and March 2020. In other words, the following models were used: 1. Stochastic dropout model with both active booker count and sales uncertainty; 2. Noise in weights with 0.2 standard deviation in active booker count and sales models; 3. Noise in weights with 20% standard deviation in active booker count and sales models.
For this study, a time-period that severely impacted the global sales due to corona virus outbreak, was chosen. The predictions were made on 15 February 2020, and then, the forecasts were compared with the actual sales to assess the impact of the corona virus outbreak. Figure 6 shows that the impact of corona outbreak on sales was mild at the beginning of February, the impact became severe only in the last week of February. Table 2 summarizes the predictions for the loss in sales made by the three models discussed above. For this study, a time-period that severely impacted the global sales due to corona virus outbreak, was chosen. The predictions were made on 15 February 2020, and then, the forecasts were compared with the actual sales to assess the impact of the corona virus outbreak. Figure 6 shows that the impact of corona outbreak on sales was mild at the beginning of February, the impact became severe only in the last week of February. Table 2 summarizes the predictions for the loss in sales made by the three models discussed above.     Figure 7 shows that the LSTM model can predict the impact of corona virus on sales by adding a new binary input variable called the corona virus impact variable. The variable determines whether the sales are impacted by the virus spread or not. The model was quite accurately able to predict sales when the new variable was added. Sales were predicted to be substantially higher if the variable was not included in the model. This study shows the power of LSTM based models to conduct "what-if" studies that otherwise would have been impossible to study in a real and practical environment. The same model can be used to understand how the sales would come back to normal levels once the menace of the virus has been conquered. The LSTM model can be integrated with models predicting when the impact of corona virus would end. Furthermore, the LSTM model can be used to perform sensitivity analysis to fathom differential change in sales with respect to differential change in the number of key account managers. This would allow the judicious hiring of key account managers. Similarly, we can study  This study shows the power of LSTM based models to conduct "what-if" studies that otherwise would have been impossible to study in a real and practical environment. The same model can be used to understand how the sales would come back to normal levels once the menace of the virus has been conquered. The LSTM model can be integrated with models predicting when the impact of corona virus would end. Furthermore, the LSTM model can be used to perform sensitivity analysis to fathom differential change in sales with respect to differential change in the number of key account managers. This would allow the judicious hiring of key account managers. Similarly, we can study how much the increase in sales would be for every percentage increase in number of agencies (clients). Adding new variables to the model allows for the simulation of more scenarios. The possibilities are endless, and the use of complex and accurate machine learning techniques lend more credibility to the analyses.

Conclusions and Future Work
LSTM modeling is an effective technique that can be used in the travel industry as it is able to successfully model the nonlinear trends and variations in sales over time. Multiple ways of modeling uncertainties in an LSTM model are presented. Uncertainties can be modeled using dropouts as noise added to input variables and as gaussian noise added to the model weights. We observed that the prediction accuracy of an LSTM model can be improved by using dropouts, and even more effectively, by adding noise to the constants of the model. Uncertainty in the model weights has the biggest impact on the model predictions suggesting that reduced depth (number of layers) of the LSTM model can be compensated by adding noise to model parameters. Perhaps, a model with more neurons and LSTM layers would lead to more accurate deterministic predictions that would, however, require more data. The impact of corona virus on hotel business could be quantified, as the models have the flexibility to include or drop input variables, making LSTMs all the more desirable. While the sales forecasts were made at a global level, the same can be performed at the source market (country) level. Uncertainty in country-specific models can be researched, and a study can be conducted to see how these uncertainties correlate with the uncertainty at a global level. Models can be developed for other lines of business such as airlines, car rental, and sightseeing to name a few. The nature of uncertainties can then be compared across product lines. Owing to their credibility in generating accurate predictions, LSTM models can be used to study various hypothetical scenarios, the results of which can be trusted for business expansion. Funding: This research received no external funding. Figure A1. Actual vs. predicted sales forecast of (1) stochastic dropout models and (2) noise embedded in predicted pales and active booker count input variables. The y-axis represents the normalized sales, whereas the x-axis represents the dates of January 2020. The shaded area represents the range of stochastic variance (uncertainty) in the model predictions. Figure A1. Actual vs. predicted sales forecast of (1) stochastic dropout models and (2) noise embedded in predicted pales and active booker count input variables. The y-axis represents the normalized sales, whereas the x-axis represents the dates of January 2020. The shaded area represents the range of stochastic variance (uncertainty) in the model predictions.