1. Introduction
The volume of U.S. citizens traveling abroad carries profound economic, political, and cultural implications, not only for the United States but also for destination countries worldwide. International travel demand directly influences the tourism and hospitality sector, shaping both inbound and outbound transportation flows [
1]. For stakeholders such as government policy-makers, transportation authorities, service providers, and international marketers, the ability to understand volumes, shifts and uncertainties in travel demand is essential for informed strategic decision-making. In an era defined by increasing global connectivity [
2] and the rising contribution of international tourism to national and global GDP [
3], accurate forecasting of outbound travel has become critical for maintaining competitiveness in the market, and ensuring its sustainable growth. Capturing the underlying structure of historical travel demand data often requires advanced statistical and computational tools that are capable of identifying latent patterns. These patterns may be deterministic, stochastic, or a combination of both. Effectively modeling the inherent complexity of the time series is therefore essential for producing reliable forecasts of future travel demand. This highlights the need for modern modeling techniques that can be used to uncover complex data patterns and effectively inform future travel demand forecasting.
Despite its importance, forecasting international travel demand remains challenging. Demand patterns are highly sensitive to structural shifts, including economic downturns and geopolitical events. Global disruptions such as the COVID-19 pandemic have also severely altered mobility and reshaped tourism dynamics [
4,
5,
6,
7,
8]. Traditional time series models, although effective for modeling linear dependencies and seasonal cycles, often fall short in accommodating the non-linear, non-stationary, and multidimensional dynamics that characterize real-world travel behavior [
9,
10]. Their reliance on stationarity assumptions, limited handling of exogenous shocks, and inability to efficiently scale with high-frequency or high-volume data pose substantial risks of bias, particularly during periods of rapid change and uncertainty. This is why forecasting based on the non-stationary time series data is challenging and requires much of the attention on better understanding the major patterns obtained from the historical data and their impact to the unobserved future pattern. It is fundamentally important to employ forecasting methods that can both capture historical data patterns and adequately reflect future uncertainty. Compared with modern machine learning (ML) approaches, traditional time series forecasting methods exhibit several limitations. Their inherently linear structure restricts their ability to model non-linear and complex temporal dependencies. Moreover, these methods often rely heavily on stationarity assumptions, which typically require substantial data pre-processing. For instance, Auto Regressive Integrated Moving Average (ARIMA) model incorporates external variables only in a limited manner, making it less suitable for handling high-dimensional data.
Recent advances in data science provide a promising pathway to overcome these limitations. ML and deep learning (DL) methods offer the ability to capture non-linear relationships among variables [
11,
12], uncover latent patterns through automated feature extraction [
13,
14], and adapt dynamically by modeling long-term temporal dependencies [
15]. A growing body of research highlights the strong performance of algorithms such as Random Forest (RF) and Gradient Boosting, as well as recurrent neural networks (NNs) including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), in improving forecasting accuracy in tourism domains [
16,
17,
18,
19]. This suggests the need to further investigate the applicability of these advanced ML methods to forecasting problems in travel research, in particular, systematically comparing the performance of different forecasting models when applied to travel demand data. This study is motivated by the urgent need for resilient and flexible forecasting models capable of addressing the dynamic and interconnected nature of global travel demand data [
20,
21]. By conducting a systematic comparison of traditional time series methods (e.g., ARIMA), classical ML models (e.g., Random Forest, Gradient Boosting), and advanced DL architectures (e.g., LSTM, GRU), this research contributes to identifying the most effective and adaptive forecasting strategies for travel demand. Unlike much of the existing literature, which tends to focus on inbound flows or domestic mobility, this study emphasizes outbound U.S. travel across global regions, a dimension that remains under-explored yet critical for understanding international tourism flows, making it a significant and original contribution to tourism research.
From a practical perspective, the findings of this study have important implications for transportation planners, tourism authorities, and policy-makers concerned with managing international mobility under uncertainty. Accurate forecasts of outbound travel demand can support infrastructure investment, airline capacity planning, marketing allocation, and policy formulation related to visa processing and international coordination. By providing region-specific forecasting insights, this study offers decision-makers actionable evidence to support data-informed planning in an increasingly volatile global travel environment. From a methodological standpoint, this study contributes a structured comparative framework for evaluating forecasting performance across heterogeneous regions using multiple modeling models. Rather than advocating a single universal approach, the analysis emphasizes how model suitability depends on regional data characteristics, including scale, volatility, and temporal structure. By systematically benchmarking statistical, ML, and DL models within a unified experimental design, this work advances empirical understanding of model selection in outbound travel demand forecasting and establishes a foundation for future extensions by incorporating additional data sources and hybrid modeling strategies.
2. Literature Review
The economic shocks and subsequent structural shifts in global mobility patterns have underscored the need for robust, adaptive forecasting approaches capable of addressing both short-term shocks and long-term transformations. In response, recent studies have increasingly turned to advanced time series models ML methods to capture the complex, non-linear, and non-stationary dynamics inherent in time series data. For example, Bontempi et al. [
22] document how tree-based ensembles and neural networks adapt to evolving data-generating processes without requiring strict parametric assumptions. Similarly, Lai et al. [
23] demonstrate that DL architectures, including recurrent and convolutional neural networks, substantially outperform traditional models when time series exhibit non-linearity and non-stationarity.
A prominent research direction centres on the integration of DL architectures to improve forecasting accuracy. For example, Chen et al. [
24] proposed a spatial–temporal transformer network that simultaneously models temporal evolution and spatial interactions, while Zhang et al. [
25] introduced a hybrid BiLSTM transformer framework to account for both short-term volatility and long-term structural patterns. Similarly, ensemble DL strategies, such as the bagging-based multivariate approach by Sun et al. [
26], demonstrate how combining multiple learners can significantly enhance robustness and reduce predictive variance. In addition, Lim and Zohren [
27] show that attention-based DL models can capture complex temporal dependencies and regime shifts that are difficult to model using classical approaches. These studies underscore a broader shift in the time series literature toward flexible, data-driven modeling frameworks that explicitly accommodate non-linearity, temporal dependence across multiple scales, and evolving data-generating mechanisms. This growing body of work highlights the potential of modern ML approaches not only to improve predictive accuracy, but also to enhance model robustness, thereby motivating their adoption in increasingly complex real-world time series forecasting.
Another stream of research emphasizes the fusion of multi-source data to enrich forecasting models. For instance, Lee [
20] incorporated web search trends as exogenous variables into an SARIMAX framework, highlighting the utility of behavioral information as leading indicators of demand shifts. Complementarily, Colladon et al. [
28] applied semantic and social network analysis to online travel forums, showing that unstructured digital footprints can provide early-warning signals for changes in travel demand behavior. These studies collectively highlight the inadequacy of traditional forecasting methods, which often fail to capture the richness and heterogeneity of contemporary travel demand dynamics. Addressing the challenge of structural breaks and crisis-induced volatility has also become a central focus. For example, Liu et al. [
29] introduced the BayesBag method, integrating bootstrap aggregation with Bayesian inference to improve model stability under uncertainty. Hybrid CNN–LSTM approaches, such as those used by [
30] in post-pandemic recovery forecasting for Vietnam, further demonstrate the potential of adaptive architectures to outperform conventional models in turbulent contexts. These contributions underscore the importance of resilient methods capable of adjusting to abrupt disruptions.
More recently, advances in data augmentation and transformer-based architectures have further expanded methodological horizons. Diao et al. [
31] leveraged spatio-temporal GANs to generate virtual samples for transformer models, effectively mitigating data sparsity issues. Similarly, Li et al. [
32] proposed a hybrid framework integrating time series decomposition with a temporal fusion transformer optimized via Bayesian search, achieving superior accuracy and interpretability. Expanding on this line, Yi et al. [
33] integrated calendar-based encodings into a transformer encoder–decoder architecture, enhancing both interpretability and predictive performance. Together, these innovations reflect a growing trend toward combining generative, predictive, and interpretable frameworks to address both data limitations and practical decision-making needs.
As evidenced by [
24,
25,
26], recent literature shows rapid application of advanced ML and DL methods to time series forecasting, particularly in the context of tourism demand prediction. Yet, two critical gaps remain: (1) most studies concentrate on inbound tourism or regional-level demand, leaving outbound travel demand comparatively underexplored; and (2) although ensemble and hybrid models have shown strong potential, their application to outbound U.S. travel demand forecasting is still limited. By addressing these gaps, the present study advances the literature through systematically evaluating a range of forecasting approaches for U.S. outbound travel. It provides both methodological contributions and actionable insights for policy-makers, transportation authorities, and the tourism industry.
5. Results
In this section, we analyze the obtained results.
Table 6 summarizes the model performance across all evaluation metrics. Overall, the LSTM model achieves the highest predictive accuracy, as evidenced by its lowest RMSE, MAE, and normalized RMSE values. This suggests a strong capability in capturing complex, non-linear temporal dependencies. However, these gains in accuracy come with increased model complexity (excessive model parameter to be trained), highlighting a trade-off between fit and parsimony, particularly relevant in contexts where interpretability and computational efficiency are critical.
Notably, simpler models such as ARIMA and Random Forest demonstrate competitive accuracy while maintaining lower complexity. We also note that modeling the aggregated data without separating regions would introduce substantial heterogeneity, as evidenced by the markedly different mean levels of outbound passenger volumes across regions. Because of this reason, we further analyze model performance across eight regions: Europe, the Caribbean, Asia, South America, Central America, Oceania, the Middle East, and Africa. Given the heterogeneity in travel demand patterns driven by seasonal, economic, cultural, and geopolitical factors, this regional breakdown offers insight into the adaptability of each model.
Model performance varied considerably across regions, reflecting differences in data scale, variability, and underlying travel dynamics. Overall, the comparative analysis highlights the importance of aligning model complexity with data characteristics and operational constraints.
For Africa, where the results are reported in
Table 7, the ARIMA(1, 1, 1) model achieved the best performance across all key error metrics, demonstrating a strong capacity to capture the underlying linear structure of passenger flows. Both XGBoost and Random Forest showed moderate results, with the latter slightly outperforming the former. Deep learning models, particularly LSTM, performed poorly, as evidenced by high RMSE, and MAE. This suggests overfitting due to limited data volume and variability, emphasizing the limitations of data-intensive models in low-sample contexts. While ARIMA offered simplicity and accuracy, it lacked flexibility to account for non-linear effects, whereas tree-based models provided adaptability at the cost of computational efficiency.
In contrast, the results for Asia, shown in
Table 8, exhibited distinct patterns favoring deep learning approaches. Both LSTM and GRU achieved the best predictive accuracy, benefiting from their ability to capture non-linear temporal dependencies and long-term relationships inherent in the data. ARIMA remained a robust yet simpler alternative, performing adequately for relatively stable time series but failing to respond effectively to non-linear shifts. XGBoost and Random Forest produced intermediate results, suggesting that their performance could improve with feature augmentation or hyperparameter tuning. These results illustrate how model suitability depends on the complexity and temporal structure of regional data.
For the Caribbean, shown in
Table 9, Random Forest delivered the lowest RMSE and normalized RMSE, confirming its superior ability to model non-linear interactions without substantial overfitting. Deep learning models such as LSTM and GRU provided moderate performance. ARIMA underperformed, reflecting its inability to represent pronounced seasonality and volatility in regional travel patterns. These findings highlight the efficacy of ensemble tree-based methods in small- to medium-scale datasets where non-linearity is present but data are insufficient for deep architectures.
In Europe, where the obtained results are given in
Table 10, LSTM achieved the lowest RMSE (219,030), while GRU obtained the lowest normalized RMSE (0.13), indicating that both models effectively captured complex seasonal and trend components in the data. However, their computational demands and tuning requirements limit their scalability for operational use. ARIMA, by comparison, underperformed due to its rigid linear assumptions, underscoring its limited adaptability to intricate time series dynamics. The results suggest that while deep learning models excel in high-variability environments, practical applications must balance predictive power against resource and interpretability considerations.
For the Middle East, XGBoost emerged as the top-performing model, yielding the lowest RMSE, normalized RMSE, and MAE values, which are shown in
Table 11. Its gradient boosting framework effectively modeled non-linear dependencies and complex regional variations. Random Forest performed moderately well, outperforming ARIMA but trailing XGBoost. LSTM, on the other hand, recorded the weakest performance, likely due to overfitting in a relatively small dataset. While XGBoost achieved the highest accuracy, its computational intensity and limited interpretability suggest that marginal gains over simpler alternatives should be carefully evaluated in applied settings.
In Oceania, from the results shown in
Table 12, we can see that ARIMA achieved the lowest RMSE and MAE, demonstrating strong predictive accuracy and reliable tracking of observed data trends. Random Forest also performed competitively, while GRU and XGBoost showed higher errors and lower stability. Although ARIMA’s linear structure effectively captured the dominant temporal trends, its inability to model complex dependencies may restrict its usefulness in more data-rich or volatile contexts. Nonetheless, for small or stable datasets, ARIMA remains a parsimonious and effective forecasting tool.
Finally, the comparative results corresponding to South America are displayed in
Table 13. In this table, we see that XGBoost outperformed all other models, achieving the lowest RMSE (23,702.39), normalized RMSE (0.17), and MAE (18,465.63). ARIMA provided a consistent baseline with clear interpretability but lagged in predictive precision. Both LSTM and GRU underperformed, likely due to limited temporal depth and data complexity. These outcomes reaffirm the balance required between accuracy, model complexity, and interpretability when selecting predictive approaches for regional forecasting. Taken together, the results reveal that model performance is context-dependent. ARIMA remains effective for stable or low-variability series, tree-based ensemble methods (XGBoost, Random Forest) excel in moderately complex and nonlinear contexts, and deep learning models (LSTM, GRU) are most suitable when sufficient data are available to support their representational capacity. The comparative findings underscore the necessity of tailoring model selection to regional data characteristics, ensuring that predictive accuracy is achieved without sacrificing scalability or interpretability.
Figure 3 and
Figure 4 illustrate substantial regional heterogeneity in U.S. outbound travel demand and corresponding differences in model performance. Across all regions, outbound passenger volumes exhibit distinct magnitudes, volatility levels, and temporal patterns, underscoring the importance of region-specific analysis. No single forecasting model consistently outperforms others across all regions, highlighting the limitations of a uniform modeling strategy. In regions characterized by higher variability and more complex demand dynamics, such as Europe and Asia, machine learning and deep learning models more effectively track observed fluctuations, while in regions with lower passenger volumes or more stable patterns, simpler models such as ARIMA remain competitive and, in some cases, superior.
The comparative results further demonstrate that model effectiveness depends critically on data richness and structural complexity. Deep learning architectures, particularly LSTM and GRU, tend to provide improved predictive accuracy in regions with sufficient data and pronounced non-linear temporal dependencies, whereas ensemble tree-based methods offer strong performance in moderately complex settings. Conversely, in data-sparse or relatively stable regions, traditional time series models yield robust and interpretable forecasts with lower computational cost. Together, these findings emphasize that outbound travel demand forecasting benefits from a flexible, region-adaptive modeling framework that balances predictive accuracy, computational efficiency, and interpretability rather than relying on a single universal approach.
6. Discussions
The empirical results reveal substantial heterogeneity in model performance across regions, underscoring that forecasting accuracy is strongly conditioned on data scale, variability, and temporal structure. Ensemble tree-based models, such as XGBoost and Random Forest, consistently performed well in regions characterized by moderate data volume and nonlinear seasonal patterns, while deep learning models exhibited advantages primarily in high-traffic regions with richer temporal dynamics. These findings indicate that no single modeling approach dominates across all contexts, highlighting the importance of aligning model complexity with regional data characteristics.
Deep learning architectures, including LSTM and GRU, demonstrated improved predictive accuracy in regions with sufficient historical depth, where long-term temporal dependencies and evolving demand patterns are more pronounced. Conversely, their weaker performance in data-sparse regions suggests sensitivity to sample size and increased risk of overfitting. This shows the practical limits of highly parameterized models in constrained settings. Tree-based ensemble methods offered a more stable balance between flexibility and robustness in several regions, making them good alternatives when deep architectures are not ideal for the data.
Feature design also played a critical role in shaping model performance. The inclusion of temporal indicators and lagged passenger volumes enhanced forecast accuracy by enabling models to better capture seasonal persistence and trend evolution. These results highlight that careful data organization and feature engineering remain essential, even when advanced learning algorithms are employed. From an applied perspective, the region-specific insights derived from this comparative analysis can support more targeted decision-making in capacity planning, marketing allocation, and policy design. By demonstrating how forecasting performance varies across regional contexts, this study provides practical guidance for selecting appropriate modeling strategies in real-world travel demand applications.
7. Conclusions and Future Work
This study demonstrates that the suitability of forecasting models for outbound travel varies significantly across regions, depending largely on the volume of data and the complexity of underlying travel patterns. For high-traffic regions such as Europe and the Caribbean, advanced machine learning models, particularly XGBoost and Random Forest, consistently outperformed traditional approaches. These models effectively captured seasonal variations and abrupt changes, delivering high predictive accuracy. In contrast, for regions with lower passenger volumes, simpler models like ARIMA proved more reliable, offering stable forecasts while mitigating the risk of overfitting in data-sparse environments.
The results emphasize the importance of region-specific model selection, where the balance between model complexity and data availability must be carefully managed. While sophisticated models are advantageous in data-rich contexts, simpler, more interpretable models offer better generalizability in regions with limited data. This trade-off between bias and variance is critical for ensuring robust and practical forecasting outcomes.
The findings of this study show direct implications for industry practitioners and policymakers. For airline revenue management, the demonstrated superiority of adaptive machine learning models enables more responsive capacity planning, allowing carriers to optimize seat inventory and route frequency in anticipation of demand fluctuations across destination regions. Airport authorities and ground handlers can leverage regional forecasts to allocate staffing and terminal resources more efficiently, particularly during seasonal peaks.
From a policy perspective, destination countries can utilize these forecasting frameworks to anticipate U.S. tourist inflows, informing visa processing capacity, border control staffing, and tourism infrastructure investments. Also, national tourism organizations can apply region-specific forecasts to allocate marketing budgets strategically, targeting promotional efforts toward periods and regions where demand elasticity is highest. At the macroeconomic level, improved forecasting accuracy supports more reliable projections of tourism-related foreign exchange earnings, informing fiscal planning in tourism-dependent economies.
Looking ahead, future research can build on these findings in several ways. First, incorporating additional features, such as economic indicators, political events, and environmental factors like weather patterns, could significantly enhance predictive performance, especially in regions characterized by high variability. Second, exploring ensemble learning techniques, such as stacking or blending, offers the potential to combine the strengths of various models. Such hybrid approaches could adaptively apply simpler models in data-scarce settings and more complex algorithms in data-rich contexts to yield consistently strong results.