1. Introduction
Forecasting respiratory virus-associated weekly hospitalization rates per 100,000 population is a core task in digital epidemiology and public health operations, supporting inpatient bed planning, ICU surge readiness, staffing, and supply-chain decisions [
1,
2,
3]. Compared with case counts, hospitalization rates are typically more actionable for capacity management and less sensitive to changes in testing behavior [
4,
5]. In the United States, weekly hospitalization surveillance is routinely reported through CDC platforms such as RSV-NET, FluSurv-NET, and COVID-NET [
6,
7], while European summaries are disseminated through ECDC/WHO, enabling short-term early-warning forecasting for healthcare systems [
8,
9].
Despite recent progress in short-term forecasting for respiratory admissions, several gaps remain for operational hospitalization forecasting under post-pandemic conditions. Recent studies have shown that statistical and hierarchical models can provide useful subnational forecasts for influenza hospital admissions, while operational ensemble systems have also been deployed for COVID-19, influenza, and RSV admissions and bed occupancy forecasting in real time [
10,
11]. At a finer operational scale, exogenous signals such as mobility and testing indicators have been shown to improve hospital-level COVID-19 admission forecasts, especially at longer lead times [
12]. In parallel, multi-source forecasting studies have highlighted that jointly leveraging heterogeneous surveillance streams can be particularly valuable when the target hospitalization signal has only a short historical record [
13].
However, three challenges remain insufficiently addressed for weekly respiratory hospitalization forecasting. First, the target often reflects a composite burden associated with co-circulating pathogens such as influenza, RSV, and SARS-CoV-2, rather than a single pathogen-specific process, which complicates model design and interpretation. Second, post-pandemic respiratory dynamics are characterized by pronounced regime shifts, making a single end-to-end model prone to entangling relatively stable seasonal structure with outbreak-driven excess dynamics. Third, although internet-based and behavioral data can provide timely supplementary information, recent reviews also emphasize that such digital signals are noisy, platform-dependent, and not always robust when incorporated naively [
14].
Motivated by these observations, we argue that a practical forecasting framework for respiratory hospitalizations should satisfy three properties simultaneously: it should preserve a stable seasonal baseline, isolate outbreak-related excess fluctuations, and integrate only a small set of complementary exogenous signals in a disciplined way. To this end, we propose a two-stage framework that first learns the pre-pandemic baseline pattern and then models the excess component using multi-source search trends. This design is intended to improve robustness, interpretability, and sample efficiency in limited-data, regime-shift settings.
To address these gaps, we propose a two-stage framework that decouples baseline seasonality from COVID-19-induced excess hospitalizations and fuses a small set of search trends in a controlled manner. Stage 1 fits a lightweight unidirectional GRU on the pre-pandemic segment to estimate the baseline [
15]. Stage 2 models the residual (excess) using independent Bi-GRU (bi-directional GRU) encoders for each trend (flu, COVID-19, fever) instead of the unidirectional GRU we use in the first stage, then followed by a standard multi-head self-attention layer to capture inter-trend dependencies [
16,
17]. We emphasize that composite outcome definitions can differ across surveillance systems (e.g., European composites can be constructed by summing pathogen-specific components, whereas U.S. RESP may represent a broader respiratory hospitalization burden); we therefore document outcome construction and alignment rules in the Methods section.
We utilize two distinct datasets to implement our experiments: For the United States (AME), we use weekly hospitalization surveillance from CDC platforms including RSV-NET, FluSurv-NET, and COVID-NET [
18,
19,
20], which are part of the CDC RESP-NET respiratory virus hospitalization surveillance infrastructure [
21]. For Europe (EU), we use weekly respiratory virus surveillance released by ECDC/WHO Europe, specifically the non-sentinel severity dataset (
nonSentinelSeverity.csv) where the indicator
hospitaladmissions is reported weekly by country. We additionally extract online search signals from Google Trends [
22] as behavioral indicators that may lead hospitalization dynamics.
On the AME dataset, our two-stage multi-trend framework achieves substantial improvements over classical and modern baselines, providing a strong accuracy–efficiency trade-off, consistent with findings that carefully designed lightweight architectures can be competitive with heavier Transformer-style models for applied forecasting tasks [
17,
23]. Importantly, systematic trend ablations identify a robust “sweet spot”: three trends yield the best generalization, whereas adding an extra trend leads to overfitting—echoing prior observations that adding noisy behavioral signals can harm predictive performance in digital epidemiology [
24,
25]. These findings support a practical deployment message: mechanism decoupling and disciplined trend fusion can outperform heavier end-to-end architectures in limited-sample, regime-shift settings.
Although our experiments are set in the COVID-19 context, the proposed framework is designed for general outbreak-driven distribution shifts rather than a single disease: Stage 1 learns a stable pre-event baseline, and Stage 2 models the event-driven excess using timely exogenous signals. Consequently, the same pipeline can be retrained and deployed for future large-scale respiratory outbreaks or pandemics when historical baseline data and early-phase observations become available.
Our main contributions are as follows:
Deployment-oriented baseline-excess two-stage learning framework: Addressing distribution shifts induced by pandemics, we reformulate weekly respiratory hospitalization rate forecasting per 100,000 population during COVID-19 as a baseline–excess decomposition problem. In the first stage, a lightweight model learns stable pre-pandemic season patterns; the second stage focuses on the non-stationary excess component by integrating multi-source search trends. This design decouples complex dynamics and can be directly adapted to future respiratory outbreaks.
Lightweight multi-trend fusion with self-attention mechanism: We design a parameter-efficient and interpretable multi-source integration scheme. Each behavioral trend stream is encoded with an independent Bi-GRU (bi-directional GRU) and fused via a standard multi-head self-attention layer, achieving disciplined trend integration with limited parameters.
Ablation-driven insights and comprehensive evaluation: Through controlled ablation experiments, we reveal an optimal number of trends: three semantically grounded trends generalize best, while additional trends induce overfitting. Beyond standard point accuracy, we employ a rolling-origin protocol and calibrated prediction intervals to better reflect deployment-time forecasting and uncertainty needs. In addition, we incorporate split conformal prediction as a principled uncertainty quantification component, providing calibrated prediction intervals with finite-sample coverage guarantees, which further enhances the practical utility of the framework in deployment-oriented settings.
4. Results
4.1. Main Results on AME (U.S.)
We first evaluate on the AME dataset under a time-respecting design consistent with the two-stage setup. Stage 1 (baseline) is trained only on the pre-pandemic segment (2018–2019) and then used to generate baseline predictions for the COVID-era period. For point-forecast comparison in
Table 3 and
Figure 3, we focus on the COVID-era segment (2020–2025) and apply a chronological 70/30 split within this segment to reflect realistic forward forecasting. The 70/30 split was chosen as a practical compromise between training sufficiency and evaluation reliability: the training portion remains long enough to learn stable seasonal and outbreak-related patterns, while the holdout portion is still sufficiently long for rolling out-of-sample assessment.
Table 3 reports the primary scale-free metrics (
and MAPE). Our proposed model achieves competitive overall performance relative to the baseline models, reaching R
2 = 0.907 and reducing MAPE to 19.22%. Notably, the strongest single-stage baseline is TCN (
, MAPE 20.24%), while most other baselines exhibit negative
, indicating substantial difficulty in generalization under pandemic-era regime shifts.
Beyond point accuracy (and as further examined in the rolling-origin evaluation below), the two-stage design provides an operationally meaningful decomposition: Stage 1 learns a stable baseline from the pre-pandemic segment, and Stage 2 forecasts the excess component on top of that baseline. This separation improves robustness when the data distribution changes sharply (e.g., variant-driven waves and shifting admission/testing behaviors), since Stage 2 is no longer forced to simultaneously fit seasonal structure and shock-driven dynamics.
4.2. Robustness: Rolling Forecast Evaluation on AME
We further evaluate robustness using an expanding-window rolling-origin protocol that mimics a deployment setting where the model is repeatedly updated as new weekly observations arrive. We generate one-step-ahead forecasts on a fixed consecutive test window (13 August 2022 to 8 November 2025; ) and report both point accuracy and uncertainty calibration.
Table 4 summarizes rolling metrics under the same test window, including RMSE/MAE and
, as well as empirical coverage and average interval widths for prediction intervals (PIs). For uncertainty quantification, we adopt split conformal calibration on a held-out calibration set and report 80% PIs for all methods; for our model we additionally report 95% PIs.
Importantly, the rolling-origin results highlight a different trade-off from the fixed-split comparison in
Table 3. On this particular test window, a simple autoregressive baseline (Ridge_L6) achieves the best one-step-ahead point accuracy, indicating that short-lag autoregression is highly competitive when the recent history is strongly predictive. In contrast, our two-stage model prioritizes a structured baseline–excess decomposition and provides calibrated uncertainty estimates via conformal prediction. As shown in
Table 4, our prediction intervals achieve coverage close to or above the nominal levels (slightly conservative), which is valuable for risk-aware operational planning where decision-makers need reliable bounds rather than point forecasts alone.
Figure 4 compares two-stage + multi-trend (from-scratch updating) rolling trajectories against the best baseline under the same window.
Figure 5 further visualizes the coverage–width trade-off for the 80% prediction interval under the same rolling-window setting on AME. Specifically, it compares empirical coverage against average interval width, thereby showing how uncertainty calibration complements point accuracy in deployment-oriented evaluation. The figure indicates that different models balance calibration and sharpness differently, providing an additional probabilistic perspective beyond point-forecast error alone.
At the end of the rolling-origin evaluation, we evaluated the calibration of conformal prediction intervals using the rolling-origin evaluation.
Table 5 summarizes the empirical coverage and deviation from nominal levels, while
Figure 6 visualizes the calibration across 10–95% nominal coverage levels.
The calibration plot indicates that the conformal prediction intervals are generally well-calibrated, with a mean absolute deviation of 0.065 from the nominal levels. At lower nominal levels (10–60%), the empirical coverage closely tracks the diagonal. At higher nominal levels (70–95%), the intervals are somewhat conservative, which is preferable for public health decision-making where under-coverage carries higher risk. Overall, the conformal framework provides reliable uncertainty quantification across the full range of coverage levels.
The results show that the two-stage framework consistently outperforms simple seasonal baselines across all tested Stage-1 baseline lengths. Even the shortest pre-pandemic window (26 weeks) achieves substantially better accuracy than the seasonal baselines, confirming that predictive gains are primarily driven by the baseline–excess decomposition rather than the absolute number of pre-pandemic weeks.
4.3. Sensitivity of Stage-1 Baseline Length
We evaluated the effect of varying the Stage-1 pre-pandemic baseline length on the AME dataset and compared the resulting forecasts against simple seasonal baselines, including a seasonal naive baseline and a regression baseline with seasonal terms.
Table 6 summarizes the performance metrics (RMSE, MAE, MAPE) for different baseline lengths (26, 35, 44 weeks).
4.4. Ablation: How Many Trends to Fuse?
We conduct controlled ablations by varying the number of Google Trends streams fused in Stage 2.
Table 7 and
Figure 7 show a clear “sweet spot”: incorporating three trends yields the best generalization, while adding a fourth trend leads to severe overfitting (
collapses and MAPE increases sharply). This result supports a practical takeaway for deployment: behavioral signals help when they are disciplined and complementary, but indiscriminately adding weak or redundant trends can destabilize forecasting.
We conducted a leave-one-feature-out ablation study across three chronological sub-periods (early: until June 2021; mid: until January 2023; late: until November 2025) to assess the temporal stability of selected Google Trends keywords for Stage-2 excess modeling.
Table 8 summarizes the impact of removing each keyword on forecast accuracy (RMSE, MAE, MAPE). The results indicate that the model’s reliance on individual Google Trends keywords varies across pandemic phases. In the mid and late periods, removing COVID-19 search trends generally yields the largest increase in RMSE relative to the full-feature baseline, confirming its importance as a leading indicator. All four keywords contribute meaningfully to overall forecast accuracy, validating the multi-source design and temporal stability of the selected keywords.
4.5. Interpretability: Signal Timing and Dynamic Feature Focus
Beyond predictive accuracy, we provide interpretable and verifiable evidence on the temporal relationship between public search behavior (Google Trends) and weekly respiratory hospitalizations in AME. We focus on wave-level timing patterns and explicitly disclose the data-resolution constraints of retrospective Google Trends queries.
4.5.1. Key Constraint: Google Trends Is Monthly for Multi-Year Retrospective Queries
For the AME analysis spanning multiple years (2018–2025), Google Trends returns monthly observations due to retrospective aggregation. Consequently, peak timing comparisons involving Google Trends carry an inherent uncertainty of approximately ±2–4 weeks. In
Figure 2, dashed lines denote linear interpolation used for visualization only; all peak-timing statements are based on the original monthly datapoints.
4.5.2. Real-Data Evidence: Wave-Level Alignment with Mixed Lead–Lag Across Keywords
Figure 8 (Delta) and
Figure 9 (Omicron) compare weekly hospital RESP rates against monthly Google Trends signals. During the Delta wave, the hospital peak occurs on 4 September 2021, while fever searches peak on 1 September 2021 (approximately contemporaneous at monthly resolution). COVID-19 searches peak around 1 August 2021 (about one month earlier), whereas flu searches peak around 1 October 2021 (about one month later). During the Omicron wave, the hospital peak occurs on 8 January 2022, while fever and COVID-19 searches peak around 1 January 2022 (approximately contemporaneous within ±2–4 weeks), and flu searches peak around 1 December 2021 (about one month earlier). Overall, the figures suggest that search behavior broadly co-moves with hospital dynamics at the wave scale, but because the Google Trends signals are only available retrospectively at monthly resolution, these comparisons should be interpreted as qualitative evidence rather than as strong week-scale early-warning claims. We further summarize the peak timing differences between hospital RESP data and Google Trends in
Table 9.
4.5.3. Conceptual Illustration: Dynamic Feature Focus in Attention-Based Fusion (Optional)
Figure 10 provides a conceptual illustration of how an attention-based fusion module may shift relative emphasis across signals across epidemic phases. This figure is included for intuition and is not produced by extracting attention weights from a trained model.
4.6. Europe: Failure Analysis, Evidence, and an Improvement Experiment
We next evaluate cross-region transferability on European surveillance series. In contrast to AME, most European country series begin around mid-2021, meaning that a fixed pandemic-onset split (e.g., 1 January 2020) leaves zero pre-pandemic baseline weeks. This violates the prerequisite of the two-stage design (Stage 1 requires a baseline segment), and explains why naive two-stage training can fail in Europe.
Failure (root cause): Under a fixed split date aligned to 2020, all European series have baseline length equal to 0, so Stage 1 cannot be trained as intended.
Evidence (why Stage 1 is still difficult even after restoring baseline): We implement an adaptive split strategy that selects a split date to ensure a non-trivial baseline segment (approximately 30–70% of available weeks, depending on the country). This restores 91–149 baseline weeks for most regions, but Stage 1 baseline learning remains challenging due to high noise and systematic bias (see
Table 10). We quantify baseline quality using coefficient of variation (CV), signal-to-noise ratio (SNR), and residual CV. In several countries, ridge-style baselines may yield superficially better in-sample fit but introduce large bias, inflating residual variance and propagating error into Stage 2.
Improvement (what we changed and what it buys us): Based on the above evidence, we adopt (i) adaptive split to restore baseline length, and (ii) a simpler seasonal baseline choice to reduce bias propagation.
Table 11 reports the resulting European forecasting performance. While the method becomes usable in several settings (positive
in multiple countries/regions), some countries remain difficult—consistent with shorter baselines (e.g., Hungary) and higher noise/bias (e.g., France), highlighting that data heterogeneity is a dominant bottleneck in Europe.
5. Discussion
5.1. Clinical and Operational Relevance
Forecasting hospitalization rates per 100,000 population supports operational dashboards that trigger timely situational awareness alerts and guide capacity adjustments. A key advantage of the proposed framework is its interpretability: Stage 1 provides a stable seasonal baseline, and Stage 2 estimates COVID-19-driven excess rates, which can be used to contextualize surges.
5.2. Complexity Analysis
The proposed Multi-Trend Cross-Attention framework is designed for computational efficiency. The total number of trainable parameters is approximately 12,718 under the default configuration. The time complexity of the trend encoders is , where L is the sequence length, T is the number of exogenous trend signals, and H represents the hidden state dimension. The cross-attention mechanism operates across the trend dimension with a complexity of , independent of L. T is still the number of exogenous trend signals, and H still represents the hidden state dimension. Consequently, the overall computational cost remains low, enabling rapid retraining and deployment on standard hardware without the need for high-performance computing clusters.
5.3. Why Baseline–Excess Decoupling Helps
Single-stage models must allocate capacity to simultaneously fit seasonal patterns and abrupt pandemic shocks. By separating these mechanisms, our approach reduces feature entanglement and focuses Stage 2 on non-stationary residual structure, which is better aligned with the epidemiological interpretation of COVID-19-driven excess.
5.4. Why Performance Drops in Some European Settings
European datasets often start later and may lack clean pre-pandemic baseline segments, weakening Stage 1 and propagating errors to the residual series. Small sample sizes further increase overfitting risk for attention-based models. These findings motivate transfer learning, automated split detection, and privacy-preserving cross-region learning (e.g., federated learning).
5.5. Limitations and Future Work
In this study, we use only Trends-based exogenous signals along with CDC hospitalization data, excluding other drivers such as policy interventions, vaccination, mobility, or climate. Accordingly, all performance claims are limited to the Trends-only setting. We focus on one-step forecasting, and future work will explore multi-horizon forecasting, uncertainty quantification, automated split learning, and incorporating richer exogenous features.
In summary, the fixed-split evaluation shows consistently positive performance of the proposed framework, while the rolling-origin evaluation indicates somewhat weaker results for point forecasting. This difference highlights the impact of sequential data arrival and evolving trends on model performance.
We note that the proposed model is most beneficial when a structured baseline–excess decomposition and calibrated uncertainty estimates are desired. In situations where simpler seasonal or autoregressive baselines already provide adequate performance, the incremental gain from the proposed framework may be limited. This guidance helps readers understand when the model is expected to provide practical advantages versus when simpler baselines may suffice.