A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model

Lee, Geun-Cheol

doi:10.3390/data10050073

Open AccessArticle

A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model

by

Geun-Cheol Lee

College of Business, Konkuk University, Seoul 05029, Republic of Korea

Data 2025, 10(5), 73; https://doi.org/10.3390/data10050073

Submission received: 20 March 2025 / Revised: 7 May 2025 / Accepted: 9 May 2025 / Published: 10 May 2025

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Versions Notes

Abstract

Tourism is a core sector of Singapore’s economy, contributing significantly to Gross Domestic Product (GDP) and employment. Accurate tourism demand forecasting is essential for strategic planning, resource allocation, and economic stability, particularly in the post-COVID-19 era. This study develops a SARIMAX-based forecasting model to predict monthly visitor arrivals to Singapore, integrating web search data from Google Trends and external factors. To enhance model accuracy, a systematic selection process was applied to identify the effective subset of external variables. Results of the empirical experiments demonstrate that the proposed SARIMAX model outperforms traditional univariate models, including SARIMA, Holt–Winters, and Prophet, as well as machine learning-based approaches such as Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs). When forecasting the 24-month period of 2023 and 2024, the proposed model achieves the lowest Mean Absolute Percentage Error (MAPE) of 7.32%.

Keywords:

tourism demand forecasting; SARIMAX; exogenous variables; Google Trends; time-series analysis; post-COVID-19

1. Introduction

Tourism plays an important role in Singapore’s economy, serving as a major driver of economic growth and employment. As of 2020, the tourism sector contributed 11.1% to Singapore’s GDP, generating approximately SGD 52.5 billion (Singapore Dollars) or USD 40.4 billion (US Dollars) in economic value [1]. Beyond direct revenue generation, tourism substantially contributes to Singapore’s fiscal revenue through taxation and tourist spending, reinforcing its role as a fundamental pillar of the nation’s economic structure [2]. The government’s strategic investments in large-scale tourism infrastructure projects, such as Marina Bay Sands and Resorts World Sentosa, have further supported the expansion of the industry, ensuring its continued contribution to national economic stability.

Given its extensive economic impact, sustaining and enhancing Singapore’s tourism sector remains a critical priority, necessitating continuous policy support and investment. Tourism demand forecasting plays a crucial role in this effort, as it enables policymakers and industry stakeholders to make decisions regarding resource allocation, infrastructure development, and strategic planning. As Frechtling emphasizes, tourism demand is perishable—unsold tourism products result in immediate losses. This fact underlines the need for accurate demand predictions. Additionally, tourism services are consumed at the point of production, requiring service providers to anticipate fluctuations in visitor numbers. Moreover, tourism demand is highly sensitive to external factors such as economic conditions, government policies, and global crises, further highlighting the necessity of robust forecasting models [3]. Given these considerations, enhancing the accuracy and reliability of tourism demand forecasting in Singapore, where the tourism industry plays a critical role, is essential for ensuring economic stability and strengthening the country’s competitiveness as a global tourist destination.

1.1. The Literature on Tourism Demand Forecasting

Due to the significance of tourism demand forecasting in Singapore, as discussed in the previous paragraphs, a variety of studies have explored this topic over the decades. In one of the earliest studies, Chan [4] proposed a sine wave time-series regression model to forecast monthly tourist arrivals in Singapore for the period January 1989 to July 1990, during which the model achieved a Mean Absolute Percentage Error (MAPE) of below 3%. Chu [5] re-examined the same dataset using an appropriately Seasonal AutoRegressive Integrated Moving Average (SARIMA) model, specifically SARIMA(3,1,0)(0,1,0)₁₂. When seasonal adjustments were properly incorporated, the SARIMA model with a MAPE of 1.857% outperformed existing models for the same period as studied by Chan [4]. While most early models assumed relatively stable environmental conditions, Chan et al. [6] investigated the impact of sudden environmental changes, specifically the Gulf War, on Singapore’s tourism demand forecasts. The authors compared five forecasting techniques, including ARIMA, Exponential Smoothing, and Naïve models, using data from 1984 to 1992. Contrary to expectations, the Naïve II model performed best with a MAPE of 3.83% in unstable conditions. A more recent study by Chu [7] explored ARMA-based methods for forecasting tourism demand in various Asia-Pacific destinations, including Singapore. Among ARMA-based models, ARFIMA (Autoregressive Fractionally Integrated Moving Average) exhibited superior performance in forecasting monthly demand during 2007–2008. Kumar and Sharma [8] employed a SARIMA model to generate monthly forecasts of tourist arrivals in Singapore. When forecasting 2014–2016 demand, SARIMA(1,0,1)(1,1,0)₁₂ was identified as the best-fit model. In contrast to time series-only approaches, Agiomirgianakis et al. [9] focused on the impact of macroeconomic and policy variables such as Gross Domestic Product (GDP), real exchange rates, exchange rate volatility, and temperature on tourist arrivals to Singapore. Employing a panel cointegration model using quarterly data from 2005 to 2014, they found that while a weaker exchange rate boosts tourism, higher volatility in exchange rates significantly decreases inbound tourism.

While these studies proposed various methodological approaches, existing research primarily focused on tourism demand forecasting before the COVID-19 pandemic. Thus, their models assume relatively stable market conditions and often fail to capture the extreme fluctuations and uncertainties of demand during the pandemic. Few recent studies have attempted to develop more resilient forecasting models. Danbatta and Varol [10] proposed a hybrid forecasting model combining an Artificial Neural Network (ANN) with Polynomial-Fourier time series decomposition. Using monthly data from 2004 to 2020, their model produced forecasts monthly visitor arrivals in Singapore for 2021. However, the model exhibited significant deviations from actual values, explaining the difficulty of making accurate forecasts during the pandemic period. Qiu et al. [11] proposed a two-stage scenario-based forecasting framework, which separately analyzed tourism demand before and after the pandemic. The first stage used traditional statistical models to forecast demand under pre-pandemic conditions and the second stage involved generating quarterly forecasts from 2020 to 2021 based on three hypothetical recovery scenarios (V, U, and L-shaped), using ensemble models with stacking. Among 26 configurations, their stacked models outperformed individual methods. Zhang et al. [12] adopted a mixed-method forecasting approach combining a quantitative model with Delphi-based expert judgment adjustments to predict tourism recovery in many countries including Singapore. Using macroeconomic indicators and quarterly data from 2000 to 2019, they first generated a baseline forecast and then adjusted it using expert consensus on possible recovery paths. Their scenario-based forecast showed that short-haul markets including Singapore would recover more quickly than long-haul markets.

1.2. Research Gaps and Objectives

Despite various existing studies as presented in the preceding subsection, several critical research gaps remain unaddressed. First, forecasting models developed prior to the COVID-19 pandemic are inherently limited in their ability to capture the demand volatility and structural disruptions observed during and after the pandemic. While a few recent studies have attempted to address these challenges, they either rely on qualitative adjustments or yield relatively large forecast errors, indicating the difficulty of producing accurate forecasts under such uncertain conditions. Therefore, there is a need for a data-driven forecasting approach that can effectively capture the volatility in tourism demand caused by COVID-19. Such a method should be capable of explaining demand patterns during the pandemic period and obtaining relatively accurate forecasts in the post-pandemic period. Given these challenges, the primary objective of this study is to develop a monthly forecasting model capable of accurately predicting Singapore’s post-COVID-19 tourism demand. To better capture the external influences that impact tourism demand, we incorporate real-time web search data, such as Google Trends queries. By extracting and integrating various tourism-related search queries, our model aims to enhance predictive accuracy and responsiveness to market changes.

The remainder of this paper is organized as follows. The next section presents an analysis of the relevant data, focusing on the characteristics of the time series of Singapore’s visitor arrivals. In particular, we examine the underlying patterns and identify key external factors related to tourism demand, with an emphasis on web search data. The subsequent section introduces the SARIMA and SARIMAX (SARIMA with eXogenous variables) time-series models. In Section 4, the procedure of model identification for SARIMA is presented and we also outline the procedure for selecting the most appropriate combination of exogenous variables from the identified influencing factors. Next, a comparative experiment is conducted to evaluate the performance of the proposed SARIMAX model. In this study, we forecast monthly visitor arrivals to Singapore for 2023 and 2024 using various time-series models and machine learning approaches, comparing the results to assess forecasting accuracy. Finally, the concluding section summarizes the key findings of this study and discusses potential directions for future research.

2. Data Analysis

In this section, we first analyze the characteristics of the Singapore visitor arrivals time series to understand its basic patterns. Through graphical visualization, we examine the overall trend, seasonality, and potential impacts of external factors on visitor arrivals. For this analysis, we utilize ten years of monthly international visitor arrivals data from 2013 to 2022. The dataset is collected from the Department of Statistics Singapore (singstat.gov.sg, accessed on 25 February 2025).

Figure 1 clearly shows a structural shift in the time series before and after the onset of the COVID-19 pandemic. From 2013 to 2019, the data exhibits a gradual upward trend with clear seasonal fluctuations, a typical pattern commonly observed in tourism demand. However, in 2020 and 2021, the number of visitors dropped sharply, approaching near zero, reflecting the severe impact of the pandemic on international travel. From 2022 onwards, the visitor arrivals show a gradual recovery, but the pattern observed during this period does not fully resemble the pre-pandemic time series. Instead, it represents a unique transitional phase, distinct from both the stable seasonal fluctuations before 2020 and the flat, near-zero arrivals during the pandemic. This suggests that the post-pandemic recovery follows a different trajectory, necessitating a revised approach to forecasting methods.

To forecast the time series beyond 2022, this study aims to identify external factors influencing tourism demand using web search data. A recent survey [13] on tourism demand forecasting indicates a growing number of research efforts utilizing web search data from platforms such as Google and Baidu to enhance predictive accuracy [13]. Recognizing the potential of online search behavior as a leading indicator of tourism activity, this study leverages Google Trends to extract relevant search queries. The selection of search queries was conducted through a combination of intuition and visual similarity between the visitor arrivals time series and the extracted search trends. The selected search queries are as follows:

Selected queries: “Singapore”, “Singapore Hotel”, “Changi”, “Singapore Flight”, “Singapore Dollar”, “Singapore Airport”, “Marina Bay Sands Singapore”, “Best Singapore”, “Singapore Weather”, “Singapore Visa”.

The selected queries were identified through a two-step process. First, we intuitively considered keywords that potential visitors to Singapore are likely to search for when planning their trips. These include terms related to travel logistics (e.g., “Singapore Flight”, “Singapore Visa”), accommodations (e.g., “Singapore Hotel”), major attractions (e.g., “Marina Bay Sands Singapore”), and general travel-related information (e.g., “Singapore Weather”, “Best Singapore”). Second, we retrieved the Google Trends search volume data for each query from 2013 to 2022 and visually examined their time series patterns. By comparing these patterns with the visitor arrivals data, we selected 10 queries that demonstrated the highest graphical similarity with tourism demand fluctuations. Figure 2 presents the time series plots of the relative search volumes for the selected queries, illustrating their temporal variations over the study period.

Figure 2 displays the relative search volumes of the ten selected Google Trends queries from 2013 to 2022, scaled between 0 and 100. It is important to note that Google Trends does not provide exact search volume data, but rather a normalized index representing the popularity of each query over time. Consequently, the y-axis represents relative search interest rather than actual search counts. For data collection, the region was set to “Worldwide” and the category to “Travel”. While the search query patterns do not perfectly align with actual visitor arrivals, they all exhibit a sharp decline during the pandemic period (2020–2021) followed by a recovery trend from 2022 onwards. This consistent behavior across different queries suggests that Google search interest reflects global travel trends and can serve as a valuable predictor for visitor demand forecasting.

To assess the statistical significance of the relationship between each search query and tourism demand, we conducted a Pearson correlation test. In addition to the ten selected Google Trends queries, this study incorporates an additional exogenous variable: the number of passengers at Singapore Changi Airport. The inclusion of this variable aims to capture the direct impact of airport activity on visitor arrivals, providing further insights into the dynamics of tourism demand. The Changi Airport passenger data were collected from the Department of Statistics Singapore. Table 1 presents the Pearson correlation coefficients between the number of visitor arrivals and the 10 selected Google Trends search queries, as well as the number of Changi Airport passengers, showing the strength and direction of their relationships.

As shown in Table 1, the selected Google Trends search queries and the number of Changi Airport passengers exhibit a strong positive correlation with visitor arrivals. The correlation coefficients range from 0.7037 to 0.9851, indicating varying degrees of association between the explanatory variables and tourism demand. Despite this variation, all p-values are small enough to show statistical significance. Based on these correlation results, this study incorporates all 11 variables (10 Google Trends queries and Changi Airport passenger numbers) as exogenous variables in the visitor arrivals forecasting model. These variables are expected to enhance the predictive accuracy of the model, and the detailed methodology for incorporating these variables into the forecasting model is described in the following section.

3. SARIMA and SARIMAX Models

Based on the characteristics of the observed data, we select SARIMA as a baseline model and then extend it to SARIMAX by including the external factors identified in Section 2. As observed in Figure 1, the pre-pandemic visitor arrivals time series exhibits a typical pattern of monthly tourism demand, characterized by seasonal fluctuations and moderate upward trends. Due to these characteristics, most previous studies on Singapore’s tourism demand forecasting have employed ARIMA-based models, which are well-suited for capturing stationary and seasonal patterns in time series data.

SARIMA models extend the standard ARIMA model by incorporating seasonal autoregressive and moving average terms. The general form of the SARIMA model for time series {

y_{t}

} is expressed as

Φ_{P} (B^{s}) ϕ_{p} (B) {(1 - B)}^{d} {(1 - B^{s})}^{D} y_{t} = Θ_{Q} (B^{s}) θ_{q} (B) ε_{t},

(1)

where B denotes the backshift operator (i.e.,

B^{k} y_{t} = y_{t - k}

), d and D represent the orders of non-seasonal and seasonal differencing, respectively,

ϕ_{p} (B)

and

θ_{q} (B)

are the non-seasonal AR and MA polynomials,

Φ_{P} (B^{s})

and

Θ_{Q} (B^{s})

are the seasonal AR and MA polynomials with seasonal period s, and

ε_{t}

is a white noise error term. More details about the SARIMA model can be found in Box et al. [14].

To enhance the model’s predictive capability by incorporating relevant external information, we adopt the SARIMAX model. The SARIMAX model augments the SARIMA framework by adding a regression component for external predictors. The SARIMAX model is expressed as

Φ_{P} (B^{s}) ϕ_{p} (B) {(1 - B)}^{d} {(1 - B^{s})}^{D} y_{t} = Θ_{Q} (B^{s}) θ_{q} (B) ε_{t} + β^{'} X_{t},

(2)

where

X_{t}

is a vector of exogenous variables at time t, and β is a vector of corresponding coefficients.

In this study, the external factors discussed in Section 2 can be included in the exogenous variables

X_{t}

in the SARIMAX model. The following sections present the procedures for model identification of SARIMA as well as the methodology used for selecting the best set of exogenous variables.

4. Methodology

Having introduced the SARIMA and SARIMAX models in the previous section, we now turn to the procedures for model identification and the parameter estimation of the SARIMA model. This section begins with the best fitting SARIMA model based on the observed time series of monthly visitor arrivals to Singapore during 2013–2022. Then, we proceed to the selection process of exogenous variables to be incorporated into the SARIMAX model. A detailed explanation of these steps is provided in the following two subsections, respectively.

4.1. SARIMA Fitting

To apply the SARIMA model fitting process, it is first necessary to identify the appropriate model order, i.e., determining the values of p, d, q, P, D, and Q. As an initial step, we assess the degree of differencing required to achieve stationarity by conducting the Augmented Dickey–Fuller (ADF) test on the original time series, first-order differenced series, and double-differenced series. Here, double-differencing refers to applying a first-order difference followed by a seasonal differencing operation at the lag of 12 months. For more details about stationarity tests in time series models, see Brockwell and Davis [15].

Table 2 summarizes the results of the ADF test conducted on the original, first-differenced, and double-differenced time series of visitor arrivals. The null hypothesis (H₀) of the ADF test states that a unit root is present, meaning that the series is non-stationary. At a significance level of 0.01, we cannot conclusively reject the null hypothesis of tests on original series and first-differenced series, suggesting that the series may still be non-stationary. The double-differenced series has an extremely small p-value, rejecting the null hypothesis, confirming that the series is now stationary.

We also display the visual results of the transformed time series. As you can see from Figure 3, the upper part is the 1st differenced time series, which shows rather large fluctuation throughout the period, while the fluctuation is diminished in double differenced time series at the lower part. Although big up and down flow is identified during the pandemic period, overall series can be confirmed stationary, as consistent with the results of the ADF test above. Based on these results, it is valid to set d = 1 and D = 1 in the SARIMA model to achieve stationarity in the time series.

Next, we determine the autoregressive (AR) order (p), moving average (MA) order (q), seasonal AR order (P), and seasonal MA order (Q) for the SARIMA model. Unlike the differencing parameters (d, D), the selection of p, q, P, and Q is not a strictly rule-based process but rather involves an empirical judgment, as introduced in Box et al. [14]. In this study, we first examined the ACF and PACF plots of the double-differenced series of the Singapore visitor arrivals (Figure 4). As shown in the figure, both the ACF and PACF exhibit a prominent first spike, and a seasonal pattern with a periodicity of 12 is observed. However, rather than strictly adhering to the lag values inferred from the plots, we explored a range of potential values for each order, considering one step lower and higher than the observed values to ensure a more comprehensive search for the optimal model parameters. Thus, we assumed that each order (p, P, q, Q) can take values from the set {0, 1, 2}, resulting in a total of 81 possible SARIMA model configurations (3 × 3 × 3 × 3 = 81). We then conduct a grid search, fitting all 81 SARIMA models and evaluating them based on model selection criteria such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The selected results are summarized in the following table.

Table 3 presents the six SARIMA model configurations selected from a total of 81 fitted models, based on the lowest AIC values. Among them, the SARIMA(1, 1, 2)(1, 1, 2)₁₂ model exhibits the lowest AIC (2128.55), which also corresponds to the second-lowest BIC (2145.22). Although the SARIMA(0, 1, 2)(1, 1, 2)₁₂ model yields the lowest BIC (2145.04), its AIC (2133.13) is considerably higher than that of the SARIMA(1, 1, 2)(1, 1, 2)₁₂ model. Therefore, we ultimately select SARIMA(1, 1, 2)(1, 1, 2)₁₂ as the baseline model. A notable characteristic of the selected model is that the number of MA orders exceeds the number of AR orders, suggesting that the time series exhibits substantial fluctuations and noise components. This SARIMA(1, 1, 2)(1, 1, 2)₁₂ model will also be used as the baseline configuration in the subsequent SARIMAX model, where exogenous variables will be incorporated.

4.2. Exogenous Variable Selection for SARIMAX

Next, we introduce the procedure for selecting exogenous variables in the SARIMAX model. In Section 2, multiple factors were identified as potential candidates for exogenous variables. However, this study does not assume that using only one exogenous variable or including all of them will necessarily yield the best results. Instead, we aim to identify the optimal subset of exogenous variables that maximizes forecasting accuracy. Given that exploring all possible subsets would result in an excessive number of combinations, making exhaustive search computationally impractical, we employ a systematic selection approach to identify the most effective subset. The following procedure is designed to efficiently determine the best-performing exogenous variable combination while balancing model complexity and predictive performance.

Backward and Forward Selection Procedure

initial_subset = all exogenous variables

best_aic = infinity

current_subset = initial_subset

# Backward: Remove variables until no AIC improvement

WHILE current_subset > 1 DO:

test removing each variable, fit SARIMAX, compute AIC

remove variable with lowest AIC

IF AIC improves THEN update best_aic, best_subset

ELSE BREAK

END WHILE

# Forward: Add variables until no AIC improvement

remaining = initial_subset − best_subset

WHILE remaining NOT empty DO:

test adding each variable, fit SARIMAX, compute AIC

add variable with lowest AIC

IF AIC improves THEN update best_aic, best_subset

ELSE BREAK

END WHILE

return best_subset

Through the above procedure, we efficiently identified a combination of exogenous variables that improved the fitting results of the SARIMAX model. The selected set of exogenous variables includes [‘Number of Passengers’, ‘Singapore Hotel’, ‘Singapore Flight’, ‘Singapore Dollar’, ‘Best Singapore’, ‘Singapore’]. The SARIMAX model fitted with only these selected exogenous variables outperformed the model that included all considered external factors. Specifically, the selected model demonstrated improved performance metrics in terms of AIC and BIC values, indicating enhanced fitness and reduced model complexity. Table 4 summarizes the performance comparison between the SARIMAX model with all exogenous variables and the SARIMAX model with the subset of the selected exogenous variables.

In summary, this study analyzed data to identify appropriate external factors, focusing on web search keywords, to forecast the number of visitors to Singapore. After fitting the SARIMA model, we employed the Backward and Forward Selection procedure to determine the optimal combination of exogenous variables that best explained the time series of Singapore’s visitor counts during the training period. The results confirmed that the selected subset of exogenous variables improved model fit while maintaining lower model complexity. To further validate the predictive performance of the proposed model, the next section presents a comparative analysis against various existing forecasting methods. Specifically, we forecast the number of visitors for a 24-month period in 2023–2024 and evaluate the results to assess the model’s effectiveness.

5. Computational Experiments

In this section, we conduct comparative experiments to validate the performance of the proposed time-series forecasting model. As mentioned in Section 2, the dataset used in this study consists of monthly visitor arrival data to Singapore from 2013 to 2022, where the period from 2013 to 2022 serves as the training dataset, and the 24 months of 2023 and 2024 are used as the test dataset for model evaluation. To assess the predictive accuracy of the proposed model, we employ three widely used performance metrics: Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).

To evaluate the performance of the proposed forecasting model, we compare it against several well-established benchmark methods. These benchmarks include widely used univariate forecasting models, as well as machine learning-based approaches. Specifically, the following methods are considered: Holt’s Method—a classical exponential smoothing technique that extends simple exponential smoothing by incorporating a trend component, allowing for improved forecasts in time series with linear trends [16]; Winters’ Method (Holt–Winters)—an extension of Holt’s Method that additionally accounts for seasonality, making it particularly effective for time series with periodic fluctuations [17]; Prophet—a robust forecasting model developed by Meta (formerly Facebook), designed to handle missing data, outliers, and holiday effects, which employs an additive regression framework with seasonality components, making it suitable for business and economic time-series forecasting [18]; Long Short-Term Memory (LSTM)—a type of recurrent neural network (RNN) that can capture long-range dependencies in time-series data. LSTM models have been widely used in tourism forecasting due to their ability to learn complex temporal patterns [19]; and Recurrent Neural Networks (RNNs)—a deep learning approach for sequential data that processes information recursively, making it suitable for capturing temporal dependencies in time-series forecasting [20]. These benchmark models provide a diverse set of approaches, ranging from traditional statistical methods to modern deep learning techniques.

To ensure the robustness of the machine learning-based approaches, we account for the inherent randomness in LSTM and RNN models by conducting five independent trials and using the average of the predicted values as the final forecast. Both of these deep learning methods are highly sensitive to hyperparameter configurations, which can significantly impact their forecasting performance. In this study, we adopted the following hyperparameter settings for LSTM and RNN models:

LSTM: Number of LSTM units: 50 (with two layers); Dropout rate: 0.2; Number of epochs: 50; Optimizer: Adam; Batch size: 32.

RNN: Number of Simple RNN units: 50; Number of RNN layers: 2; Optimizer: Adam; Number of epochs: 100; Batch size: 32.

Table 5 presents the performance evaluation of the tested forecasting models using three key metrics. Among the tested methods, the SARIMAX model with the selected exogenous variables, the proposed model, achieved the best performance in terms of MAPE of 7.32%. This confirms that the systematic selection of exogenous variables significantly improved forecasting accuracy compared to alternative SARIMA-based approaches. The SARIMAX model incorporating all exogenous variables also performed well, which shows best in terms of MAE. In contrast, the SARIMA model without any exogenous variables performed considerably worse, with a two-digit MAPE indicating that the inclusion of carefully selected external factors is crucial for improving predictive accuracy.

Among the traditional time-series forecasting models, Holt’s Method and the Winters Method showed relatively weak performance, respectively. Other univariate methods including Prophet, LSTM, and RNN do not outperform the proposed model. Prophet performed the worst among the tested models. It appears to have overemphasized the recent pandemic-related drop in demand, resulting in poor forecasts. The performance of LSTM and RNN did not meet expectations in this test, likely due to the relatively small dataset size and the challenges associated with optimizing deep learning models for time-series forecasting. Overall, the results indicate that incorporating an appropriate combination of exogenous variables significantly improves forecasting accuracy. For more detailed analysis of the forecasting results, we draw the line plot of actual and forecasted values in the following figure.

Figure 5 presents a comparison between the actual number of visitor arrivals (blue solid line) and the forecasted values (red dashed line) obtained using the proposed SARIMAX model for the years 2023 and 2024. The plot reveals that the SARIMAX model overall captures well the general trend and seasonal variations in visitor arrivals. Additionally, while the model effectively follows the overall pattern of fluctuations throughout the years, slight underestimations and overestimations are visible in some months. However, it can be observed that the forecast errors tend to increase in 2024, which is further away from the training data period. Despite these minor deviations, the proposed SARIMAX model provides a reliable forecast, as indicated by the relatively low error metrics in the above table.

6. Conclusions

This study proposed and evaluated a SARIMAX-based forecasting model to predict monthly visitor arrivals to Singapore in the post-COVID-19 period, specifically for the years 2023 and 2024. By integrating real-time web search data from Google Trends and Changi Airport passenger numbers as exogenous variables, the model aimed to enhance predictive accuracy and capture the complex dynamics of tourism demand following the unprecedented disruptions caused by the global pandemic. The empirical results demonstrate that the proposed SARIMAX model, with a carefully selected subset of exogenous variables, showed good performance compared to traditional univariate time-series models (e.g., SARIMA, Holt’s Method, Winters’ Method) as well as machine learning-based approaches (e.g., Prophet, LSTM, RNN) in terms of MAPE, MAE, and RMSE. The superior performance of the proposed model, achieving a MAPE of 7.32%, underscores the value of incorporating external factors—such as online search behavior and airport activity—into tourism demand forecasting frameworks.

While this study demonstrates the effectiveness of the proposed SARIMAX model in forecasting Singapore’s tourism demand, several areas remain for future research to enhance the model’s practical applicability and accuracy. One key challenge in real-world forecasting applications is that the exogenous variables used in the model must be available at the time of prediction. In practice, certain external factors may only become available with a time lag, limiting their immediate usability for forecasting. To address this issue, future studies should explore lag analysis to identify exogenous variables that are available in advance and can still provide strong predictive power. By incorporating lagged versions of the selected exogenous variables, the model can be adapted for real-time forecasting applications. Additionally, while this study primarily relied on Google Trends search queries and Changi Airport passenger data as exogenous variables, future research could explore other alternative data sources, such as social media sentiment analysis, mobility data from mobile devices, airline booking trends, or recent generative AI query data. The integration of such alternative data sources may further improve the robustness and responsiveness of the forecasting model to sudden market changes. Another promising avenue for future work involves hybrid modeling approaches, where machine learning techniques such as LSTM or transformer-based models could be integrated with traditional SARIMAX models. By utilizing the strengths of both statistical and deep learning methods, a hybrid approach could capture both linear and nonlinear patterns in tourism demand more effectively. Furthermore, future research could consider regime-switching time series models, such as Markov Switching models, to explicitly capture structural breaks in tourism demand surrounding the COVID-19 pandemic. These models have been applied in other domains, such as financial market analyses around the pandemic period [21]. Their application in tourism forecasting could improve the understanding of demand dynamics under volatile conditions.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The monthly visitor arrivals to Singapore and the monthly number of passengers at Changi Airport are collected from the Department of Statistics Singapore (singstat.gov.sg).

Acknowledgments

This paper was written as part of Konkuk University’s research support program for its faculty on sabbatical leave in 2025.

Conflicts of Interest

The author declares no conflicts of interest.

References

Ngoc, B.H.; Hoang, C.C.; Tram, N.H.M. A Time-Varying Analysis between Economic Uncertainty and Tourism Development in Singapore. PLoS ONE 2024, 19, e0302980. [Google Scholar] [CrossRef] [PubMed]
Wong, D.W.H.; Tai, A.C.L.; Chan, D.Y.T.; Lee, H.F. Can Tourism Development and Economic Growth Mutually Reinforce in Small Countries? Evidence from Singapore. Curr. Issues Tour. 2024, 27, 1316–1331. [Google Scholar] [CrossRef]
Frechtling, D. Forecasting Tourism Demand; Routledge: London, UK, 2012. [Google Scholar] [CrossRef]
Chan, Y.-M. Forecasting Tourism: A Sine Wave Time Series Regression Approach. J. Travel Res. 1993, 32, 58–60. [Google Scholar] [CrossRef]
Chu, F.-L. Forecasting Tourist Arrivals: Nonlinear Sine Wave or ARIMA? J. Travel Res. 1998, 36, 79–84. [Google Scholar] [CrossRef]
Chan, Y.-M.; Hui, T.-K.; Yuen, E. Modeling the Impact of Sudden Environmental Changes on Visitor Arrival Forecasts: The Case of the Gulf War. J. Travel Res. 1999, 37, 391–394. [Google Scholar] [CrossRef]
Chu, F.-L. Forecasting Tourism Demand with ARMA-Based Methods. Tour. Manag. 2009, 30, 740–751. [Google Scholar] [CrossRef]
Kumar, M.; Sharma, S. Forecasting Tourist In-Flow in South East Asia: A Case of Singapore. Tour. Manag. Stud. 2016, 12, 107–119. [Google Scholar] [CrossRef]
Agiomirgianakis, G.; Serenis, D.; Tsounis, N. Effective Timing of Tourism Policy: The Case of Singapore. Econ. Model. 2017, 60, 29–38. [Google Scholar] [CrossRef]
Danbatta, S.J.; Varol, A. Forecasting Foreign Visitors Arrivals Using Hybrid Model and Monte Carlo Simulation. International J. Inf. Technol. Decis. Mak. 2022, 21, 1859–1878. [Google Scholar] [CrossRef]
Qiu, R.T.R.; Wu, D.C.; Dropsy, V.; Petit, S.; Pratt, S.; Ohe, Y. Visitor Arrivals Forecasts amid COVID-19: A Perspective from the Asia and Pacific Team. Ann. Tour. Res. 2021, 88, 103155. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Song, H.; Wen, L.; Liu, C. Forecasting Tourism Recovery amid COVID-19. Ann. Tour. Res. 2021, 87, 103149. [Google Scholar] [CrossRef] [PubMed]
Khaidi, S.M.; Abu, N.; Muhammad, N. Tourism Demand Forecasting—A Review on the Variables and Models. J. Phys. Conf. Ser. 2019, 1366, 012111. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis Forecasting and Control, 4th ed.; John Wiley and Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Brockwell, P.J.; Davis, R.A. Introduction to Time Series and Forecasting; Springer Texts in Statistics; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Holt, C.C. Forecasting Seasonals and Trends by Exponentially Weighted Averages. Carnegie Inst. Technol. 1957, 52, 1–52. [Google Scholar]
Winters, P.R. Forecasting Sales by Exponentially Weighted Moving Averages. Manag. Sci. 1960, 6, 324–342. [Google Scholar] [CrossRef]
Taylor, S.J.; Letham, B. Forecasting at Scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Bouteska, A.; Sharif, T.; Abedin, M.Z. COVID-19 and Stock Returns: Evidence from the Markov Switching Dependence Approach. Res. Int. Bus. Financ. 2023, 64, 101882. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Time series plot of monthly Singapore’s international visitor arrivals (2013–2022).

Figure 2. Time series plot of relative Google Trends search volumes for selected queries (2013–2022).

Figure 3. First- and double-differenced time series of monthly visitor arrivals to Singapore.

Figure 4. ACF and PACF of the double differenced time series of monthly visitor arrivals to Singapore.

Figure 5. Actual vs. forecasted monthly visitor arrivals using the proposed SARIMAX model for 2023–2024.

Table 1. Correlation analysis of visitor arrivals with the selected Google Trends search queries and Changi Airport passenger data.

Variables	Correlation Coefficient	p-Value
Number of Changi Airport Passengers	0.9851	<0.0001
Singapore Weather	0.9400	<0.0001
Changi	0.9377	<0.0001
Best Singapore	0.9314	<0.0001
Singapore Visa	0.9023	<0.0001
Singapore Airport	0.9003	<0.0001
Singapore	0.8680	<0.0001
Marina Bay Sands Singapore	0.8219	<0.0001
Singapore Hotel	0.8205	<0.0001
Singapore Flight	0.7195	<0.0001
Singapore Dollar	0.7037	<0.0001

Table 2. Summary of ADF test results.

Series	ADF Statistic	p-Value
Original Series	−1.52	0.524
First-Differenced Series	2.72	0.071
Double-Differenced Series	−6.16	<0.001

Table 3. AIC and BIC values for the selected SARIMA models.

(p, d, q)(P, D, Q)	AIC	BIC
(1, 1, 2)(1, 1, 2)	2128.55	2145.22
(2, 1, 2)(1, 1, 2)	2129.15	2148.20
(1, 1, 2)(2, 1, 2)	2129.67	2148.73
(0, 1, 2)(0, 1, 2)	2130.83	2152.27
(0, 1, 2)(1, 1, 2)	2133.13	2145.04
(1, 1, 2)(0, 1, 2)	2133.58	2150.26

Table 4. Fitting results of SARIMAX models with the different exogenous variables.

Models	AIC	BIC
SARIMAX with all the introduced exogenous variables	1975.70	2018.58
SARIMAX with the selected variables	1964.33	1995.30

Table 5. Test results for 24-month forecasts (2023–2024) by the tested methods.

Tested Methods	MAPE (%)	MAE	RMSE
SARIMAX with the selected variables	7.32	99,635	135,903
SARIMA	20.88	266,257	302,408
SARIMAX with all the variables	7.36	99,178	130,917
SARIMAX with the one variable	9.90	131,330	161,147
Prophet	84.10	1,064,223	1,088,519
Holt Method	24.99	332,416	376,055
Winters Method	34.98	453,002	486,392
LSTM	31.76	414,207	446,316
RNN	8.33	103,836	126,660

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, G.-C. A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model. Data 2025, 10, 73. https://doi.org/10.3390/data10050073

AMA Style

Lee G-C. A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model. Data. 2025; 10(5):73. https://doi.org/10.3390/data10050073

Chicago/Turabian Style

Lee, Geun-Cheol. 2025. "A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model" Data 10, no. 5: 73. https://doi.org/10.3390/data10050073

APA Style

Lee, G.-C. (2025). A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model. Data, 10(5), 73. https://doi.org/10.3390/data10050073

Article Menu

A Data-Driven Approach to Tourism Demand Forecasting: Integrating Web Search Data into a SARIMAX Model

Abstract

1. Introduction

1.1. The Literature on Tourism Demand Forecasting

1.2. Research Gaps and Objectives

2. Data Analysis

3. SARIMA and SARIMAX Models

4. Methodology

4.1. SARIMA Fitting

4.2. Exogenous Variable Selection for SARIMAX

5. Computational Experiments

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI