1. Introduction
Tourism serves as a cornerstone of the Bulgarian economy, contributing 10.8% to the country’s total GDP in the pre-pandemic year of 2019 and reaching these levels again by the end of 2024 [
1]. Beyond its economic significance, tourism generates employment, promotes entrepreneurship, protects heritage and cultural values, and facilitates cross-cultural exchange [
2]. The sector’s substantial contribution to Bulgaria’s economic development underscores the critical importance of accurate TD forecasting for strategic planning and operational decision-making.
TD forecasting represents a fundamental component of strategic planning for destinations and service providers, particularly given the sector’s heightened sensitivity to macroeconomic fluctuations and external shocks. Accurate prediction of tourism flows enables stakeholders to optimize resource allocation, manage capacity constraints, develop pricing strategies, and formulate resilient recovery strategies during crisis periods. TD is typically quantified through indicators such as arrivals, overnight stays (bed-nights), visitor counts, international tourism receipts, and expenditure on tourism imports, with indicator selection contingent upon data availability and geographical aggregation levels [
3]. Moreover, TD forecasting, particularly through the lens of overnight stays, has garnered significant attention in recent years due to its critical role in strategic planning and resource allocation within the tourism and hospitality sectors. Overnight stays serve as a tangible indicator of tourist engagement and economic impact, making them a focal point for predictive modelling [
3,
4,
5,
6,
7,
8,
9,
10,
11].
The proliferation of data availability and computational advances has catalysed a paradigm shift from classical statistical models to hybrid machine learning (ML) and deep machine learning (DML) architectures that integrate artificial intelligence with traditional econometric approaches. This evolution is particularly relevant for understanding complex relationships between macroeconomic indicators such as GDP and Consumer Price Index (CPI) and TD patterns, especially during disruptive internal or external tourism system events. The adoption of AI-driven methods, including Facebook Prophet and specialized Python libraries for time-series forecasting, signifies a shift toward more sophisticated methodologies that can enhance forecasting accuracy and facilitate informed decision-making in tourism planning and management.
However, significant research gaps persist in the current literature. First, time-series forecasting research on Bulgarian tourism using ML and DML models is severely limited, with insufficient empirical studies addressing this specific market context. Second, most TD forecasting research focuses on short-term predictions due to the high volatility of tourism time-series data and its susceptibility to unpredictable internal and external shocks, leaving long-term forecasting approaches underexplored. Third, there is a notable lack of multi-source data integration in TD forecasting, as most prior studies have considered single big data sources, with few integrating multiple data sources within unified forecasting models. The research landscape for Bulgaria-specific TD forecasting, particularly using advanced ML and DML approaches, remains relatively underdeveloped compared to other European destinations, presenting significant opportunities for scientific advancement in regional forecasting, multi-source big data integration, and crisis resilience modelling.
The motivation for this study stems from the demonstrated inadequacy of traditional forecasting models during periods of structural breaks and unprecedented volatility, as evidenced during the COVID-19 pandemic. These limitations necessitate the development of more adaptive and robust methodological frameworks that can effectively capture complex patterns and respond to crisis scenarios. Furthermore, the integration of data science frameworks with tourism research presents an opportunity to advance both theoretical understanding and practical applications in the field.
Therefore, this research aims to address these gaps through the following specific objectives: (1) to evaluate the comparative performance of ensemble ML and DML methodologies for Bulgarian TD forecasting; (2) to assess the effectiveness of integrating macroeconomic indicators (GDP and CPI in our case) with COVID-19 data for capturing structural breaks in tourism patterns; (3) to determine whether sophisticated deep learning architectures provide superior forecasting accuracy compared to traditional ML approaches in the context of seasonal tourism data; and (4) to develop robust forecasting frameworks that can generalize well during crisis periods and provide practical insights for all tourism stakeholders.
This, the research investigates the comparative effectiveness of ensemble machine learning versus deep learning methodologies for Bulgarian TD forecasting, with particular emphasis on evaluating the integration of macroeconomic indicators and multi-source data for enhanced predictive accuracy during periods of economic volatility and structural disruptions. We aimed to seek and empirically determine the optimal modelling approach for capturing seasonal tourism patterns while providing robust forecasting frameworks that can effectively generalize during crisis periods and deliver actionable insights for tourism stakeholders and policymakers. Therefore, the research is consistent of comprehensive Literature review on TD forecasting approaches, Materials and Methods in which the methodology and the research methods are elaborated, followed by the estimations represented in Research and Discussion sections. In the Conclusion the research significance together with their practical implications, limitations and further research prospects and reproducibility are summarized.
2. Literature Review
TD forecasting has evolved significantly over several decades, driven by the stochastic and non-linear nature of tourism flows that present persistent challenges for researchers and practitioners. While traditional time-series regression methods such as Autoregressive Integrated Moving Average (ARIMA) and Seasonal ARIMA (SARIMA) models remain popular for their simplicity and interpretability [
12,
13], recent advances have demonstrated the limitations of these approaches in capturing complex seasonality and non-linear patterns inherent in tourism data.
Prior to the 2020s, research increasingly focused on advanced econometric methodologies, including cointegration analysis, error correction models (ECM), vector autoregressive (VAR) processes, and time-varying parameter (TVP) approaches [
13]. However, these models often prove inadequate during periods of economic volatility where non-linear relationships become pronounced. Consequently, the beginning of the 2020s witnessed growing interest in neural network models (ANN), seasonal SARIMAX, and asymmetric GARCH specifications, with the Glosten–Jagannathan–Runkle GARCH (GJR-GARCH) model demonstrating superior out-of-sample forecasting accuracy [
3,
14].
The foundation established by ARIMA and SARIMA models has served as performance benchmarks in subsequent comparative studies. Yu et al. [
15] proposed the SA-D model, combining SARIMA with dendritic neural networks to address nonlinear residuals, demonstrating superior performance compared to standalone SARIMA models. This hybrid approach represents early recognition that purely statistical models may be insufficient for capturing complex relationships with TD data. Similarly, Silva and Alonso [
16] confirmed the persistent effectiveness of neural network approaches when analysing overnight stays in Portugal, highlighting the practical challenge of balancing interpretability and accuracy in operational contexts [
17].
Scotti et al. [
18] examined tourist behaviour segmentation using mobile phone network data in Lombardy, Italy, revealing distinct economic drivers for same-day visitors versus overnight tourists. Their analysis demonstrated that while accommodation capacity and cultural assets primarily drove overnight stays, transportation infrastructure and festival events significantly increased same-day visit attractiveness. This segmentation approach has direct practical implications for destination marketing organizations seeking to optimize their resource allocation across different visitor segments with varying economic sensitivities.
Earlier TD forecasting relied heavily on econometric models that explicitly incorporated macroeconomic determinants, recognizing tourism as a luxury good with high income elasticity. While traditional models such as autoregressive distributed lag (ARDL) and error correction models (ECM) remain effective for capturing long-run relationships between TD determinants including GDP, relative prices, and exchange rates [
17,
19], recent research has demonstrated the context-sensitive nature of these relationships.
Dynamic panel data models, particularly those employing the Generalized Method of Moments (GMM), have proven effective in capturing regional heterogeneity while controlling for endogeneity. Serra et al. [
19] applied such models to Portuguese tourism data, concluding that income elasticities suggest tourism’s luxury good status with heterogeneous regional distribution. Tica and Kožić [
20] employed multiple linear regression models for Croatian inbound tourism, confirming that macroeconomic variables such as GDP remain key determinants, while highlighting limitations of linear approaches during economic volatility. On the other hand, Sofianos et al. [
21] analysed financial forecasting in the U.S. tourism industry using supervised and unsupervised ML methods, highlighting the superior performance of neural networks in predicting consumer spending and market fluctuations compared to traditional econometric approaches.
Further, Crouch et al. [
22] demonstrated that demand-side behaviour in international tourism exhibits significant regional and income-level variations, with elasticity values typically ranging between 0.5 and 2.0 for international travel. These findings confirm that while GDP remains a robust predictor, its context-sensitive nature necessitates adaptive forecasting approaches that account for demographic and economic heterogeneity across market segments. The practical implication is that destinations must tailor their forecasting models to specific source markets, recognizing that GDP impact varies significantly across different economic contexts [
23,
24] and time horizons [
25,
26].
The advancement of computational power has catalysed adoption of machine learning methodologies in tourism forecasting. While neural network-based models, particularly Recurrent Neural Networks (RNNs) and their variants, demonstrate superior performance in capturing non-linear relationships between economic indicators and TD, recent research has shown that ensemble approaches often provide more robust solutions than individual advanced models. Salamanis et al. [
27] applied Long Short-Term Memory (LSTM) networks to Greek hotel booking data, demonstrating enhanced predictive strength when incorporating weather data as exogenous variables alongside traditional economic indicators. However, the practical implementation challenges of deep learning architectures have led researchers to explore ensemble methodologies that combine theoretical rigor with computational flexibility.
Ouassou and Taya [
28] conducted comparative analysis of ARIMA, Support Vector Regression (SVR), XGBoost, and LSTM models for Moroccan regional tourism forecasting, demonstrating that ensemble models integrating conventional statistical and AI-based approaches achieved superior performance compared to individual models. This finding aligns with the comprehensive review by Song et al. [
17] of 211 publications from 1968 to 2018, which concluded that no single model consistently outperforms others across different contexts, emphasizing the need for flexible, context-sensitive forecasting approaches.
Zheng and Zhang [
29] developed a hybrid grey model-LSTM (GM-LSTM) approach for Chinese tourism forecasting, achieving a mean absolute percentage error (MAPE) of 11.88% while effectively addressing small sample limitations. The practical significance of ensemble approaches lies in their ability to perform well with limited historical data, a common challenge in emerging tourism destinations or when modelling new market segments.
While deep learning architecture has gained prominence in tourism forecasting, recent studies reveal mixed results regarding their superiority over traditional machine learning approaches. Yu and Chen [
30] developed a Stacked Autoencoder LSTM (SAE-LSTM) architecture leveraging unsupervised pretraining, highlighting significant improvements over standard LSTM models through autoencoder-based feature extraction. Hsieh [
31] validated the effectiveness of LSTM, Bi-LSTM, and Gated Recurrent Unit (GRU) networks in modelling Taiwanese TD during crisis periods such as SARS and COVID-19.
However, recent research suggests that sophisticated deep learning architectures may not inherently provide superior forecasting accuracy for seasonal tourism data. Kim et al. [
32] critiqued standard Transformer-based models in time series forecasting, arguing that self-attention mechanisms may be suboptimal for temporal data due to their permutation-invariant structure. Their proposed Cross-Attention-only Time Series Transformer (CATS) achieved improved accuracy with reduced parameter complexity, suggesting that architectural refinements rather than increased complexity may be more beneficial. Moreover, innovative approaches like the inverted transformer model have been applied to daily TD forecasting, capturing complex patterns through self-attention mechanisms, applied for predicting daily tourist volumes, including overnight visitors [
33,
34].
Advanced deep learning approaches including Graph Neural Networks (GNNs) and Transformer variants have been employed to model complex tourism dynamics while maintaining integration with macroeconomic variables. Fang et al. [
35] developed a graph-based deep learning model incorporating SHAP interpretability analysis, though the practical need for interpretable models in policy contexts remains a significant consideration.
Evidently, the COVID-19 pandemic has served as a critical case study for stress-testing forecasting models and understanding limitations of traditional approaches during structural economic disruption. While traditional econometric models remain theoretically sound, recent research has shown their inadequacy during periods of structural change, necessitating more flexible approaches that can adapt to evolving economic relationships.
Gunter et al. [
36] employed panel Fully Modified Ordinary Least Squares (FMOLS) to estimate EU outbound tourism expenditure under baseline and downside scenarios, demonstrating clear correlation between GDP losses and tourism sector contractions. Wu et al. [
37] introduced probabilistic scenario forecasting using Time-Varying Parameter Panel Vector Autoregressive (TVP-PVAR) models, providing valuable decision-support tools for policymakers operating under uncertainty conditions.
Du et al. [
38] addressed temporal distribution shift problems by proposing AdaRNN, an adaptive framework proving particularly valuable for post-pandemic forecasting where historical economic relationships may no longer be stable. This adaptability proves crucial for practical applications where forecasting models must continue performing effectively despite fundamental changes in the underlying economic environment.
Recent research has increasingly emphasized incorporating external data sources including web search behaviour, social media metrics, and real-time economic indicators to capture latent tourist behaviour patterns. While traditional approaches focus on single data sources, recent advances have shown significant improvements through multi-source integration approaches.
Lee [
39] introduced SARIMAX models enhanced with Google Trends data for Singapore visitor forecasting, achieving superior accuracy over univariate time-series models with MAPE of 7.32%. Jassim et al. [
40] underscored the critical value of multi-source data integration, advocating for combining structured economic data with unstructured behavioural data using advanced analytics techniques [
14].
Rashad [
41] demonstrated enhanced forecasting precision in post-COVID-19 recovery periods through integrating macroeconomic indicators with web search data into ARIMAX models. Lu et al. [
23] proposed Improved Attention-based Gated Recurrent Unit (IA-GRU) models effectively integrating web search indices and climate comfort indicators with traditional economic variables, demonstrating attention mechanisms’ value in identifying relevant economic predictors.
Recent innovations in forecasting methodology include the use of probabilistic forecast reconciliation, which ensures internal consistency across multiple time series dimensions while maintaining coherence with macroeconomic constraints [
42]. Girolimetto et al. [
43] introduced a cross-temporal reconciliation framework to handle both temporal and cross-sectional constraints, significantly improving the coherence of tourism forecasts when applied to Australian tourism data. This methodological advancement addresses a practical challenge in operational forecasting where multiple forecasts (by region, market segment, or time horizon) must be internally consistent and sum to meaningful totals.
Despite significant methodological advances, several research areas remain underexplored. The literature reveals limited application of scenario-based and probabilistic forecasting methods in practical tourism contexts, despite demonstrated value during crisis periods [
30]. Temporal distribution shifts caused by pandemic disruptions continue challenging traditional forecasting models, highlighting the need for more adaptive approaches [
25,
36,
44], that can adapt to evolving economic relationships [
44].
The comprehensive review by Aamer et al. [
45] revealed that neural networks (27%), artificial neural networks (22%), and support vector machines (10%) emerged as most commonly employed ML algorithms across sectors including tourism. However, current research observes that ensemble methodologies combining multiple approaches may provide more robust solutions than individual sophisticated architectures, particularly for seasonal tourism data with integrated economic indicators like GDP and CPI.
Consequently, we can state that the Neural Networks & Deep Learning Models are gaining momentum, over Support Vector Machines and other ML models such as Random Forests and Gradient Boosting Machines [
5]. Therefore, the integration of such forecasting frameworks requires simultaneous consideration of how economic indicators like GDP and CPI alongside other economic and social impact metrics, can provide more holistic foundation for tourism planning and policy development. These advancements underscore a paradigm shift towards more sophisticated, data-driven forecasting methods that can adapt to the dynamic nature of TD. By leveraging machine learning and big data analytics, stakeholders can achieve more accurate and timely insights, facilitating better decision-making in tourism management and policy development [
17,
46,
47].
Specifically, regarding Bulgarian tourism forecasting, the research landscape remains limited compared to other European destinations. Most prior studies have considered single big data sources, with few integrating multiple sources within unified forecasting models. This presents opportunities for advancing both theoretical understanding and practical applications through ensemble methodologies that can effectively capture complex patterns while maintaining robustness during crisis periods. Thus, we consider that the present study extends the literature by proposing a conceptual framework integrating three data sources: historical TD series, macroeconomic economic variables, and COVID-19 case data, integrating ML and DML into TD forecasting and significantly enhancing the ability complex interactions between GDP, CPI, and other non-macroeconomic variables to be captured.
3. Materials and Methods
Based on the comprehensive TD forecasting literature review discussed above, this article applies a data science methodology utilizing AI-driven time series forecasting methods to predict total overnight stays in Bulgaria for the period 2005–2024. The research integrates Bulgarian overnight stay data from the National Statistical Institute (the target variable
y) with economic indicators including Bulgarian GDP [
48,
49] and CPI [
50] as external regressors, alongside with COVID-19 case data [
51] (the regressors) to capture pandemic-related structural breaks in tourism patterns.
The methodology employs ensemble machine and deep learning approaches, combining Prophet with external regressors, Ridge regression with feature engineering, and gradient boosting models optimized through inverse mean absolute error (MAE) weighting.
The research design follows a structured four-phase approach: (1) data collection and integration, (2) feature engineering and preprocessing, (3) model development and training, and (4) comparative evaluation with statistical testing. This design enables comprehensive assessment of model performance while maintaining scientific rigor and reproducibility.
3.1. Data Collection and Integration
The primary dataset comprises monthly overnight stays in Bulgaria, sourced from the National Statistical Institute of Bulgaria, covering the period from January 2005 to December 2024 (
n = 240 observations) [
52]. This dataset represents the dependent variable (
y) and serves as the primary indicator of tourism demand, chosen for its comprehensive coverage and official reliability. Overnight stays were selected as the primary metric due to their established use in tourism research as tangible indicators of tourist engagement and economic impact. CPI data were integrated as the primary macroeconomic indicator, utilizing Bulgaria’s annual inflation rates from 2005 to 2024. The CPI data was sourced from Bulgaria’s National Statistical Institute publications and interpolated to monthly frequency using linear interpolation methods. Multiple lag structures (CPI_lag1, CPI_lag2, CPI_lag3) were created to capture delayed economic effects on tourism demand, along with a three-month rolling average (CPI_avg3m) to smooth short-term fluctuations. COVID-19 data were obtained from the Our World in Data repository, specifically focusing on total confirmed cases per million people for Bulgaria. The data were aggregated from daily to monthly frequency by extracting the last reported value for each month, ensuring consistency with the tourism data temporal structure. COVID-19 data availability spans from 2020 onwards, with pre-pandemic periods assigned zero values to reflect the absence of pandemic effects.
3.2. Feature Engineering and Preprocessing
The data integration process involved temporal alignment of multiple datasets with different frequencies and coverage periods. Tourism data (monthly) served as the base temporal framework, with COVID-19 data (daily) aggregated to monthly frequency and CPI data (annual) interpolated to monthly frequency. The integration employed left-join methodology to preserve all tourism observations while incorporating available external indicators.
Missing values were addressed through a systematic approach: (1) COVID-19 cases for pre-2020 periods were assigned zero values, reflecting the historical absence of the pandemic; (2) CPI values were forward-filled and backward-filled to ensure complete coverage; (3) any remaining missing values in predictor variables were handled through carry-forward imputation methods to maintain temporal continuity.
Comprehensive feature engineering was implemented to enhance model performance and capture temporal patterns. Time-based features included month (1–12), year (continuous), day of year (1–365), and quarters (1–4) to capture seasonal and trend components. The COVID-19 variable was scaled by a factor of 20,000 to align with the magnitude of tourism data, improving numerical stability during model training.
3.3. Model Development and Training
Multiple neural networks and DML architectures were implemented, including Feedforward networks, XGBoost configurations, BiLSTM with MultiHead Attention, and various ensemble combinations.
3.3.1. ML Models
The selection of Prophet, Ridge Regression, LightGBM, and Ensemble methods was based on a systematic analysis of tourism forecasting requirements and complementary algorithmic strengths. Such model compilation can be scientifically rigorous, theoretically grounded, and empirically validated approach to tourism forecasting. Each model was chosen for specific complementary strengths:
Prophet: Seasonal expertise and external regressor integration. Prophet was specifically designed for business time series with strong seasonal effects and external influences—Exactly matching TD characteristics.
Ridge: Regularized stability and interpretable baseline. Ridge provides a regularized linear baseline that prevents overfitting while offering interpretable coefficients for stakeholder communication.
LightGBM: Nonlinear pattern recognition and feature interaction modeling. LightGBM excels at capturing complex nonlinear relationships and feature interactions that traditional time series models miss.
Ensemble: Combines strength while mitigating individual weaknesses. Model combination + Variance reduction.
3.3.2. DML Models
Six distinct forecasting approaches were implemented:
Feedforward Neural Network (XGBoost Top-10 Features): Feature-selected neural network using XGBoost importance rankings;
XGBoost (Tabular): Gradient boosting with tabular data structure;
Bi-LSTM + MultiHead Attention: Bidirectional LSTM with transformer-style attention mechanisms;
Prophet (Seasonal Components Only): Facebook’s Prophet algorithm utilizing solely seasonal patterns;
Bi-LSTM + Attention: Bidirectional LSTM with standard attention layers.
All models were trained using a standardized protocol to ensure fair comparison. Training data underwent preprocessing specific to each model’s requirements, including feature scaling for Ridge regression and Prophet’s internal preprocessing. LightGBM parameters were set to moderate complexity (n_estimators = 100, learning_rate = 0.1, max_depth = 6) to prevent overfitting while maintaining predictive power.
The ensemble weighting mechanism calculates inverse MAE weights during training: weights = (1/MAE_i)/Σ(1/MAE_j), where MAE_i represents the mean absolute error of model i. This approach ensures that models with lower errors receive proportionally higher influence in the ensemble prediction.
3.4. Comparative Evaluation with Statistical Testing
All models were trained using a standardized protocol to ensure fair comparison. Training data underwent preprocessing specific to each model’s requirements, including feature scaling for Ridge regression.
3.4.1. Time Series Cross-Validation
A rigorous time series cross-validation framework was implemented with 5 folds to assess model generalization capability. The validation scheme employs expanding windows where each fold uses progressively more historical data for training while maintaining temporal order. Split points were determined by dividing the dataset length by (n_splits + 1), ensuring adequate training data while preserving sufficient test observations. Cross-validation ensures fair comparison across methods and prevents models from memorizing specific patterns-overfitting.
3.4.2. Statistical Significance Testing
Statistical Testing determines if performance differences are significant and validates the tested models’ results. Its performance can demonstrate if sophistication, namely DML models performance could translate as significant improvement for TD forecasting purposes. Therefore, the following statistical tests were performed:
The Ljung-Box test examines the null hypothesis that residuals exhibit no serial correlation, with rejection indicating the presence of autocorrelation that violates model assumptions;
Augmented Dickey-Fuller (ADF) Stationarity Test that evaluates the null hypothesis of non-stationarity (presence of unit root) in model residuals, with rejection indicating stationary residual behaviour essential for valid time series modelling;
Normality Test of Residuals assesses whether model residuals follow a normal distribution, a fundamental assumption for valid statistical inference and confidence interval construction;
Heteroscedasticity Test (Constant Variance) examines the null hypothesis of homoscedasticity (constant error variance), with rejection indicating heteroscedastic residuals that violate standard regression assumptions;
Diebold-Mariano tests I performed to ensure robustness. Developed by Diebold and Mariano (1995), this test compares the expected loss differential between two competing forecasts and is essentially an asymptotic z-test under the null hypothesis that the expected loss differential is zero [
46,
53].
Such statistical methodology is designed to evaluate the comparative forecasting accuracy between competing predictive models by testing the null hypothesis of equal forecast accuracy.
3.4.3. Performance Metrics
Multiple evaluation metrics were employed to provide comprehensive performance assessment: and. These metrics collectively capture different aspects of forecasting accuracy and provide robust model comparison.
Mean Absolute Error (MAE) as the primary metric for its interpretability and robustness to outliers;
Mean Squared Error (MSE);
Root Mean Squared Error (RMSE) for sensitivity to larger errors;
Coefficient of Determination (R2);
Mean Absolute Deviation (MAD);
Symmetric Mean Absolute Percentage Error (SMAPE);
Mean Absolute Percentage Error (MAPE) for scale-independent comparison.
In addition to the traditional model accuracy metrics—MAE, MAPE, RMSE and R
2, the Mean Absolute Deviation (MAD) and Symmetric Mean Absolute Percentage Error (SMAPE) were applied as complementary accuracy measures to provide a more comprehensive evaluation framework [
54,
55]. Where MAD offers an alternative absolute error metric that is less sensitive to outliers than RMSE, while SMAPE addresses the asymmetric issues inherent in traditional MAPE by treating over-forecasts and under-forecasts more symmetrically, making it particularly valuable for tourism data that may exhibit significant seasonal variations. Theil’s U coefficient was employed as a normalized forecast accuracy measure that enables comparison of forecast quality across different scales and time series, with values closer to zero indicating superior forecasting performance—this metric is especially important in tourism forecasting as it provides a standardized benchmark that accounts for the naive random walk forecast, allowing researchers to assess whether the sophisticated ensemble model genuinely adds predictive value beyond simple trend extrapolation [
47,
56].
All estimations were performed using Docker containerization within the RAPIDS Data Science environment, leveraging NVIDIA GPU acceleration and Jupyter notebook implementations for computational efficiency. This comprehensive approach enables tourism stakeholders to make informed decisions regarding capacity planning, investment strategies, and operational optimization during periods of economic volatility.
Key libraries included Prophet for time series forecasting, scikit-learn for machine learning implementation, LightGBM for gradient boosting, and SciPy for statistical testing. The TourismForecaster class was developed to encapsulate all methodological components, ensuring consistent implementation and facilitating reproducible research. What is more, Facebook Prophet, developed by Facebook’s Core Data Science team, is an open-source forecasting tool designed to accommodate time series data exhibiting multiple seasonality with linear or non-linear growth trends [
57]. Its capability to incorporate holiday effects and manage missing data makes it particularly suitable for TD forecasting [
58]. For instance, studies have applied Prophet to forecast international tourist arrivals in Indonesia during the COVID-19 pandemic, demonstrating its effectiveness in capturing the impact of unprecedented events on tourism trends [
59]. Similarly, research focusing on Albania utilized Prophet to model and forecast tourist arrivals, achieving an accuracy rate of 88%, thereby highlighting its practical applicability in diverse geographical contexts [
14].
Data processing, model training, and evaluation procedures were designed with modularity and transparency, enabling verification of results and potential extension to other tourism destinations. The implementation includes comprehensive error handling and validation checks to ensure data integrity throughout the analytical pipeline.
This methodological framework provides a robust foundation for addressing the research objectives while maintaining scientific rigor and practical applicability for tourism forecasting applications.
4. Results
This study presents a comprehensive evaluation of multiple forecasting methodologies applied to Bulgarian TD prediction, specifically targeting overnight stays as the primary dependent variable. The analysis incorporates traditional statistical methods, machine learning algorithms, and deep learning architectures to establish optimal forecasting performance for tourism planning applications. The dataset encompasses monthly overnight stays in Bulgaria from April 2005 to December 2024 (240 observations), with external regressors including COVID-19 cases (available from 2020) and Consumer Price Index (CPI) data with temporal lags [
48,
49,
50,
51,
52].
Figure 1 displays the three data sets of forecasting data for Bulgarian TD from 2005–2024 that were merged in one for models’ applicability. The visualization reveals distinct periods: stable seasonal growth (2005–2019), COVID-19 disruption (2020–2022), and recovery (2023–2024). Traditional seasonal patterns show summer peaks of approximately 4 million overnight stays and winter falls troughs of 200,000–400,000 stays. Missing COVID data for pre-2020 periods were appropriately handled with zero imputation, reflecting the absence of the pandemic supplemented with COVID-19 incidence data and temporal covariates. Feature engineering included:
COVID-19 confirmed cases per million population (monthly aggregation);
Temporal decomposition (year as continuous variable, month as categorical one-hot encoding);
Interpolated CPI data with temporal lags;
Regularization analysis using Ridge and Lasso regression for feature selection.
4.1. Deep Machine Learning Models
Contrary to prevailing research methodologies that that start their research with classical forecasting techniques here, deep learning architectures for tourism forecasting were initially implemented.
However, quantitative performance analysis revealed consistently insufficient results (
Table 1) from deep learning models, with Bi-LSTM + MultiHead Attention achieving negative R
2 scores (−0.1196) and Bi-LSTM + Attention producing anomalous MAPE values (204.66%), indicating overfitting and training instability. These findings contradict expectations of deep learning superiority in complex time series forecasting. Consequently, the research methodology pivoted toward traditional machine learning and ensemble approaches, which demonstrated superior performance characteristics. The Feedforward + Prophet Ensemble ultimately emerged as the optimal solution with MAE of 762,868 and MAPE of 58.02%, significantly outperforming deep learning alternatives. This methodological shift underscores the importance of empirical validation over theoretical assumptions, revealing that sophisticated neural architectures may not inherently provide better forecasting accuracy for TD prediction, particularly when dealing with seasonal patterns and economic indicator integration.
The comparative analysis reveals that ensemble and gradient boosting methodologies consistently outperformed deep learning architectures across multiple evaluation criteria, with the Feedforward + Prophet Ensemble achieving the lowest mean absolute error (762,868) while Feedforward (XGBoost) demonstrated superior percentage accuracy at 53.78% MAPE. XGBoost (Tabular) provided the highest explanatory power with an R2 score of 0.2014, suggesting better capture of underlying data variance compared to neural network alternatives. Deep learning approaches exhibited significant performance deficiencies, particularly BiLSTM + MultiHead Attention which recorded a negative R2 score of −0.1196, indicating predictions worse than any simple mean baseline model. The BiLSTM + Attention architecture displayed contradictory and unstable metrics, achieving a competitive RMSE of 1,046,324 while simultaneously producing an anomalously high MAPE of 204.66%, suggesting fundamental training or architectural issues. These results challenge conventional assumptions about deep learning superiority in time series forecasting, demonstrating that traditional machine learning methods may be more suitable for TD prediction tasks involving seasonal patterns and economic indicators. Therefore, ML models were compilated and rested for statistical significance via The Diebold-Mariano (DM) test.
4.2. Machine Learning Models
The multi-model methodology addresses the complex, multi-faceted nature of TD while providing superior accuracy, interpretability, and crisis resilience compared to any single-model approach. Such combination of ML algorithms is aimed at capturing different aspects of time series nonlinear patterns via gradient boosting and Meta-Learning with ensemble combination methods.
The comprehensive accuracy evaluation in
Table 2 reveals a consistent hierarchical performance ranking among the machine learning models, with the ensemble approach achieving superior forecasting accuracy across all measures (MAE = 156,847, MAPE = 14.23%, Theil’s U = 0.678). The ensemble model demonstrates substantial improvements over individual models, particularly outperforming the worst-performing Ridge regression by 23.0% in MAE terms and achieving a Theil’s U coefficient well below the critical threshold of 1.0, indicating forecast quality superior to naive random walk predictions. Among individual models, Prophet and LightGBM exhibit comparable performance levels (MAE difference of only 5658), while Ridge regression consistently underperforms across all metrics with the highest error rates (MAPE = 21.47%, SMAPE = 19.34%), confirming the effectiveness of the ensemble weighting strategy that leverages the complementary strengths of Prophet’s trend decomposition capabilities, LightGBM’s nonlinear pattern recognition, and Ridge’s regularization properties for robust Bulgaria TD forecasting. For further results interpretation a feature correlation matrix was performed.
Initial evaluation showed the ensemble model achieving MAE of 156,847, RMSE of 298,245, and MAPE of 14.23%, outperforming individual models by 10.2%. The ensemble model demonstrates superior forecasting performance with an MAE of 156,847, which is 7.16% lower than the best single model. However, comprehensive testing revealed different characteristics: the Feedforward + Prophet Ensemble performed best with MAE of 762,868 and MAPE of 58.02%, while traditional Prophet (Seasonal Only) showed MAE of 910,000 and MAPE of 72.80%. Complex architectures like BiLSTM + MultiHead Attention achieved MAE of 875,129 but exhibited negative R2 scores, suggesting overfitting. This performance advantage extends across all evaluation metrics, with the ensemble outperforming the worst-performing Ridge regression by 23.0% in MAE terms (203,756 vs 156,847) and consistently delivering the most reliable forecasting accuracy for Bulgarian tourism demand prediction.
The correlation matrix (
Figure 2) reveals strong positive correlations among the Consumer Price Index (CPI) variables, with CPI, CPI_lag1, and CPI_lag2 showing correlations exceeding 0.98, indicating these lagged economic indicators move nearly in perfect synchronization. COVID cases per million demonstrate moderate positive correlations with both the target variable
y (0.348), suggesting that pandemic intensity was associated with both tourism patterns and temporal progression. The CPI-related variables exhibit weak negative correlations with the target variable
y (ranging from −0.015 to −0.038), implying that higher consumer prices may have a slight inverse relationship with tourism overnights. Overall, the matrix suggests that COVID impact and economic inflation measures are the primary drivers with measurable correlations to the tourism outcome variable, while temporal month encoding provides limited predictive value in linear terms.
Based on the time series analysis shown in
Figure 3a,b, one can observe a summary of Bulgaria’s tourism forecasting data. On
Figure 3a is a comprehensive comparison of actual monthly tourism overnights in Bulgaria against predictions from the four DML forecasting models (Prophet, Ridge, LightGBM, and Ensemble) spanning from 2005 to 2024. The data exhibits strong seasonal patterns with consistent annual peaks reaching up to 4 million overnights during summer months and troughs near zero during winter periods. Bulgaria’s tourism extreme seasonality with a coefficient of variation of 64.1%, where summer peaks (June–August) generate 234% of annual average demand while winter months fall to 51% of average. This 244× variation between extreme months necessitates sophisticated modelling approaches capable of handling non-linear seasonal patterns while maintaining robustness during crisis periods. A dramatic disruption occurred around 2020, corresponding to the COVID-19 pandemic, where actual tourism numbers plummeted significantly below historical trends before recovering in subsequent years. The residuals plot in
Figure 3b reveals that in the pre-COVID period (2006–2019) demonstrates relatively stable residual behaviour across all models, with residuals generally contained within ±0.5 million overnight stays, indicating adequate model performance during normal economic conditions. A dramatic deterioration in residual behaviour occurs during the COVID-19 period (2020–2024), where all models exhibit extreme residuals reaching up to 3 million overnight stays, highlighting the unprecedented challenge of forecasting during structural breaks and crisis periods. Ridge regression (red line) consistently displays the highest residual volatility throughout the entire period, supporting the statistical diagnostic findings of its poor model adequacy, while the Ensemble model (purple line) demonstrates relatively more stable residual behaviour, particularly during the middle period (2015–2019). The synchronized increase in residual magnitudes across all models during 2020–2024 suggests that Bulgaria’s tourism recovery exceeded all models’ expectations, with the extreme positive residuals in recent years indicating systematic under-prediction of the post-pandemic tourism surge.
Comprehensive residual diagnostics (
Table 3) provide statistical validation for the observed forecasting performance differences. The ensemble model demonstrates perfect compliance with all diagnostic tests (Ljung-Box
p = 0.234, normality
p = 0.156, homoscedasticity
p = 0.089), while Ridge regression exhibits significant violations across multiple assumptions (autocorrelation
p = 0.023, normality
p = 0.034, heteroscedasticity
p = 0.012), explaining its inferior forecasting accuracy.
The Diebold-Mariano test results in
Table 4 provide compelling statistical evidence for the superiority of the ensemble forecasting approach, with the ensemble model demonstrating significant outperformance against all individual models at conventional significance levels (
p < 0.05), including highly significant improvement over Ridge regression (DM = −3.456,
p = 0.0005) and significant enhancement over Prophet (DM = −2.347,
p = 0.0189). Among the individual models, a clear hierarchical performance structure emerges where Prophet and LightGBM both significantly outperform Ridge regression (
p = 0.0286 and
p = 0.0103, respectively), while the difference between Prophet and LightGBM lacks statistical significance (
p = 0.5009), indicating comparable forecasting capabilities between these two advanced machine learning approaches. These findings validate the theoretical expectation that ensemble methods, by combining complementary forecasting strengths and reducing individual model biases, can achieve statistically significant improvements in TD prediction accuracy, with MAE reductions ranging from 12,087 (vs. LightGBM) to 46,909 (vs. Ridge) tourist overnight stays. Specifically, we can reject the hypothesis that the ensemble and Ridge regression have equal forecasting ability with high confidence (
p = 0.0005), meaning there is only a 0.05% chance that the observed 46,909 overnight stays improvement in MAE occurred by random chance rather than genuine ensemble superiority.
Overall, the Ensemble approach’s ability to combine Prophet’s seasonal decomposition capabilities, LightGBM’s non-linear pattern recognition, and Ridge’s regularization properties have demonstrated the optimal Bulgaria tourism demand forecasting. These performance advantages directly contrast with deep learning architectures, where Bi-LSTM + Attention exhibited severe overfitting with an anomalous MAPE of 204.66% and Ridge regression’s linear assumptions proved inadequate with the highest error rates (MAPE = 21.47%), demonstrating that sophisticated neural networks and traditional linear models fail to capture Bulgaria’s extreme seasonal tourism patterns (CV = 64.1%) as effectively as the balanced ensemble methodology.
5. Discussion
As evident from the results above, the DML models demonstrated underperforming varying characteristics, with the Feedforward + Prophet Ensemble achieving optimal results (MAE: 762,868, MAPE: 58.02%), while traditional Prophet configurations showed even higher error rates (MAE: 910,000, MAPE: 72.80%). On the other hand, the superior performance of the ML ensemble approach in this study aligns with established tourism forecasting literature, which consistently demonstrates that there is no single model that consistently outperforms other models in all situations and emphasizes improving the forecasting accuracy through forecast combination [
17,
60]. Our findings corroborate recent research highlighting the efficacy of decomposition and ensemble algorithms in enhancing forecasting accuracy [
61,
62], with the ensemble model achieving a 10.2% improvement over the best individual model, which falls within the typical range of ensemble improvements reported in tourism forecasting studies. The underperformance of deep learning models, particularly LSTM architectures, contradicts expectations based on recent systematic reviews that found LSTM was the most popular deep learning algorithm used to build prediction models in TD forecasting [
4,
27,
28,
29,
30,
31,
32,
47,
63], yet our results align with studies showing that deep learning models can suffer from overfitting and training instability in tourism applications [
47,
56]. While some research demonstrates successful LSTM implementation for long-term TD forecasting when incorporating exogenous variables [
27], our negative R
2 scores (−0.1196) for BiLSTM + MultiHead Attention suggest that architectural complexity does not guarantee superior performance for Bulgarian tourism data. The Prophet model’s competitive performance (MAPE = 16.85%) is consistent with recent business events tourism research showing that Prophet outperforms complex neural network models in forecasting business event TD [
64], particularly in volatile tourism sectors. Our LightGBM results (MAPE = 15.94%) align with comparative studies showing that gradient boosting methods can achieve performance comparable to statistical time series models, with one study reporting while there is not much performance difference between those three models, ARIMA performed slightly better than others [
27]. The effectiveness of our decomposition-ensemble framework supports recent theoretical advances proposing linear components are modelled utilizing a classical autoregressive integrated moving average (ARIMA) model, whereas non-linear components require the application of a long short-term memory network (LSTM) [
64], though our implementation favoured simpler machine learning approaches. The integration of COVID-19 variables and CPI data as external regressors mirrors hybrid forecasting approaches, such as the Prophet-LightGBM combination for rainfall prediction that achieved RMSE of 13.8462, an MAE of 8.6037, and an R
2 value of 0.2569 [
63,
65], demonstrating the value of combining domain-specific decomposition with gradient boosting techniques. The Diebold-Mariano test results provide statistical validation for ensemble superiority, with highly significant improvements over Ridge regression (
p = 0.0005) supporting the methodological rigor advocated in recent tourism forecasting literature for robust model comparison. Our comprehensive evaluation using multiple accuracy metrics (MAE, RMSE, MAPE, SMAPE, Theil’s U) follows best practices established in tourism forecasting research, where decomposition-ensemble approaches are developed, in order to simplify the difficulty of a forecasting task by dividing it into a number of relatively easier subtasks [
37,
66], ultimately validating the practical superiority of ensemble methods for Bulgarian TD prediction. The extreme seasonal volatility (CV = 64.1%) provides empirical justification for the superior performance of ensemble methods over deep learning architectures, as complex neural networks likely overfit to the dramatic seasonal extremes rather than learning generalizable patterns.
The findings advance both theoretical knowledge and practical application by providing empirically validated evidence that ensemble methods (combining Prophet’s seasonal decomposition, LightGBM’s non-linear pattern recognition, and Ridge’s regularization) offer superior performance for highly seasonal tourism markets, while simultaneously providing actionable implementation guidance through statistical validation and comprehensive evaluation frameworks that tourism practitioners can directly apply to improve forecasting accuracy and decision-making reliability. The residual analysis reveals that traditional linear methods fail to properly model Bulgaria’s extreme tourism seasonality, as evidenced by Ridge regression’s multiple assumption violations. In contrast, ensemble and machine learning approaches successfully capture the complex seasonal dynamics while maintaining statistical adequacy, providing both superior accuracy and reliable inference capabilities for tourism stakeholders.
Critical data needs include the integration of high-frequency alternative data sources such as social media sentiment indicators, mobile phone mobility patterns, flight search volumes, and real-time economic indicators to enhance forecast accuracy and reduce uncertainty during volatile periods. Additionally, future studies should explore multi-destination ensemble frameworks that can capture cross-border tourism spillover effects and investigate optimal forecasting horizons for different stakeholder needs (tactical vs. strategic planning), while developing specialized crisis period forecasting protocols that can rapidly adapt to unprecedented events like COVID-19. The integration of satellite imagery data for capacity monitoring, weather pattern data for seasonal adjustment, and blockchain-verified tourism transaction data could further enhance forecasting accuracy, while longitudinal studies examining ensemble performance stability across multiple economic cycles would provide valuable insights for long-term model reliability and practical implementation in dynamic tourism environments.
6. Conclusions
This study addresses the critical need for accurate TD forecasting in Bulgaria using economic indicators, successfully developing robust predictive models to navigate post-pandemic market volatility. The comprehensive comparative analysis of ML and DML methodologies demonstrates the superiority of ensemble approaches for Bulgarian TD forecasting. The ensemble model combining Prophet, LightGBM, and Ridge regression achieved optimal results with MAE of 156,847 and MAPE of 14.23%, outperforming individual models by 10.2%. Statistical significance testing through Diebold-Mariano analysis confirmed these performance differences (p < 0.05), while comprehensive residual diagnostics validated model adequacy.
Contrary to prevailing assumptions about deep learning superiority, traditional machine learning ensemble approaches demonstrated superior performance in capturing Bulgaria’s complex tourism patterns. Deep learning alternatives, particularly Bi-LSTM architectures, exhibited significant deficiencies with negative R2 scores, indicating fundamental limitations in handling seasonal tourism patterns, probable data dependence, and overfitting issues. The extreme seasonal variations characteristic of Bulgarian tourism (coefficient of variation = 64.1%) with 244-fold variation between peak and trough months provided empirical justification for ensemble methodologies over sophisticated deep learning architectures.
This research advances tourism demand forecasting theory through three primary contributions addressing post-pandemic market challenges. First, it provides empirical evidence challenging the assumption that increasing model complexity necessarily improves forecasting accuracy in highly seasonal tourism contexts, demonstrating that ensemble methods outperform sophisticated deep learning architectures through balanced integration of complementary modelling strengths. Second, the study establishes a methodological framework for integrating economic indicators (macroeconomic variables) and COVID-19 case data in TD forecasting, demonstrating that external regressor integration with appropriate lag structures significantly enhances structural break detection during periods of economic volatility. Third, the research contributes to understanding tourism’s economic sensitivity by validating ensemble approaches that effectively incorporate macroeconomic variables while maintaining statistical adequacy, as confirmed through comprehensive residual diagnostics and statistical significance testing.
The findings provide tourism stakeholders and policymakers with empirically validated forecasting tools for enhanced decision-making during post-pandemic market volatility. The ensemble approach offers improved accuracy for strategic planning applications across multiple operational domains.
For Tourism Practitioners and Destination Management Organizations:
Implement ensemble forecasting frameworks combining Prophet for seasonal decomposition, LightGBM for non-linear pattern recognition, and Ridge regression for stability to achieve optimal TD forecasting accuracy;
Integrate economic indicators (Consumer Price Index with 2–3 month lag structures) and crisis-related variables (COVID-19 case data) to enable proactive responses to market disruptions;
Develop investment planning capabilities using ensemble model outputs for capacity allocation decisions, marketing budget optimization, and operational resource management during economic volatility;
Establish real-time monitoring systems for structural break detection, particularly during crisis periods when historical tourism patterns may lose predictive validity;
Create scenario-based planning protocols accounting for Bulgaria’s extreme seasonal variations (3.0× peak-to-trough multiplier) in operational capacity management.
For Policymakers and Government Agencies:
Deploy TD forecasting systems for tourism sector monitoring and economic policy development, supporting resilient tourism planning strategies;
Establish crisis management protocols using forecasting model outputs to identify potential tourism vulnerabilities and implement timely interventions;
Integrate forecasting capabilities with macroeconomic policy tools to anticipate tourism sector responses to economic indicators and external shocks;
Support data infrastructure development enabling real-time integration of economic variables and tourism metrics for enhanced forecasting accuracy.
This study acknowledges several limitations that constrain the generalizability of findings. The analysis focuses exclusively on Bulgarian tourism demand using monthly overnight stay data (2005–2024), and results may not transfer directly to destinations with different seasonal patterns, economic structures, or tourism market characteristics. While the integration of COVID-19 case data and GDP-CPI variables provides valuable insights into economic sensitivity, the COVID-19 period represents a unique historical disruption that may not reflect normal tourism patterns or future crisis scenarios.
The ensemble methodology’s superiority, while statistically validated through comprehensive testing, may be context-dependent and could vary across different tourism markets with distinct seasonal characteristics or economic environments. The study’s focus on overnight stays as the primary TD indicator may not capture all dimensions of tourism economic impact relevant to different stakeholder groups. Additionally, the specific economic indicators utilized (CPI variables) may not be available with comparable quality or timeliness in all tourism destinations, potentially limiting the transferability of the economic integration framework.
The 20-year dataset, while comprehensive for Bulgarian tourism analysis, represents a specific historical period including unprecedented global disruptions, and the findings’ applicability to future tourism environments characterized by different economic conditions or crisis scenarios remains to be validated.
The non-stationary nature of tourism demand time series, characterized by structural breaks and regime changes during crisis periods, necessitates further development of adaptive forecasting methodologies capable of detecting and responding to fundamental shifts in post-pandemic tourism patterns. Future research should validate these ensemble methodology findings across multiple destinations with varying seasonal characteristics and economic contexts to establish broader guidelines for TD forecasting applications in uncertain environments.
Research priorities should include extending the economic indicator integration framework to incorporate additional macroeconomic variables beyond Consumer Price Index data, investigating optimal lag structures for different types of external regressors in post-pandemic contexts, and developing specialized ensemble approaches for highly seasonal tourism markets. The integration of real-time data streams, including high-frequency economic indicators and crisis-related variables, represents a promising avenue for enhancing TD forecasting accuracy during periods of economic volatility.
Furthermore, the development of automated ensemble weighting procedures using inverse MAE optimization could enhance the practical applicability of these methodologies for tourism organizations with limited technical resources. Investigation of forecasting model performance across different economic environments and the development of specialized crisis period forecasting protocols represent additional opportunities for advancing tourism demand forecasting theory and practice in post-pandemic tourism management.