A Mechanism-Disentangled Two-Stage Forecasting Framework with Multi-Source Signal Fusion for Respiratory Hospitalizations

Li, Zhengze; Meng, Fanyu; Liu, Haoxiang; Bian, Jing

doi:10.3390/electronics15081656

Open AccessArticle

A Mechanism-Disentangled Two-Stage Forecasting Framework with Multi-Source Signal Fusion for Respiratory Hospitalizations

School of Software, Taiyuan University of Technology, Taiyuan 030600, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1656; https://doi.org/10.3390/electronics15081656

Submission received: 27 February 2026 / Revised: 3 April 2026 / Accepted: 6 April 2026 / Published: 15 April 2026

Download

Browse Figures

Versions Notes

Abstract

Accurate forecasting of respiratory virus-associated hospitalization rates per 100,000 population is essential for healthcare capacity planning, yet remains challenging during the COVID-19 era due to abrupt distribution shifts and symptom overlap among influenza-like illnesses caused by multiple pathogens. We propose a two-stage deep learning framework that disentangles stable pre-pandemic seasonal dynamics from COVID-19-induced excess hospitalizations. A lightweight GRU is first trained on pre-pandemic surveillance data to model baseline influenza/RSV-driven seasonality, after which an excess model learns from the residual series and integrates multiple online search trends (flu, COVID-19, and fever) using a standard multi-head self-attention mechanism. While we use COVID-19-era data as a case study, the proposed baseline–excess decomposition is not disease-specific and is intended to generalize to future large-scale respiratory outbreaks or pandemics that induce abrupt regime shifts. Experiments on U.S. weekly respiratory hospitalization rate data curated from CDC surveillance networks (AME) show that the proposed approach achieves strong accuracy on a chronological COVID-era split (2020–2025), reaching

R^{2} = 0.907

with MAPE = 19.22%. Beyond point forecasts, we further evaluate an expanding-window rolling-origin protocol and report calibrated prediction intervals via split conformal prediction, supporting deployment-oriented uncertainty quantification. By decoupling baseline and excess components and fusing behavioral trend signals in a disciplined manner, this framework improves predictive performance under regime shift while providing interpretable excess estimates for timely situational awareness and healthcare resource planning.

Keywords:

time series forecasting; residual learning; GRU; hospitalization rate per 100,000 population

1. Introduction

Forecasting respiratory virus-associated weekly hospitalization rates per 100,000 population is a core task in digital epidemiology and public health operations, supporting inpatient bed planning, ICU surge readiness, staffing, and supply-chain decisions [1,2,3]. Compared with case counts, hospitalization rates are typically more actionable for capacity management and less sensitive to changes in testing behavior [4,5]. In the United States, weekly hospitalization surveillance is routinely reported through CDC platforms such as RSV-NET, FluSurv-NET, and COVID-NET [6,7], while European summaries are disseminated through ECDC/WHO, enabling short-term early-warning forecasting for healthcare systems [8,9].

Despite recent progress in short-term forecasting for respiratory admissions, several gaps remain for operational hospitalization forecasting under post-pandemic conditions. Recent studies have shown that statistical and hierarchical models can provide useful subnational forecasts for influenza hospital admissions, while operational ensemble systems have also been deployed for COVID-19, influenza, and RSV admissions and bed occupancy forecasting in real time [10,11]. At a finer operational scale, exogenous signals such as mobility and testing indicators have been shown to improve hospital-level COVID-19 admission forecasts, especially at longer lead times [12]. In parallel, multi-source forecasting studies have highlighted that jointly leveraging heterogeneous surveillance streams can be particularly valuable when the target hospitalization signal has only a short historical record [13].

However, three challenges remain insufficiently addressed for weekly respiratory hospitalization forecasting. First, the target often reflects a composite burden associated with co-circulating pathogens such as influenza, RSV, and SARS-CoV-2, rather than a single pathogen-specific process, which complicates model design and interpretation. Second, post-pandemic respiratory dynamics are characterized by pronounced regime shifts, making a single end-to-end model prone to entangling relatively stable seasonal structure with outbreak-driven excess dynamics. Third, although internet-based and behavioral data can provide timely supplementary information, recent reviews also emphasize that such digital signals are noisy, platform-dependent, and not always robust when incorporated naively [14].

Motivated by these observations, we argue that a practical forecasting framework for respiratory hospitalizations should satisfy three properties simultaneously: it should preserve a stable seasonal baseline, isolate outbreak-related excess fluctuations, and integrate only a small set of complementary exogenous signals in a disciplined way. To this end, we propose a two-stage framework that first learns the pre-pandemic baseline pattern and then models the excess component using multi-source search trends. This design is intended to improve robustness, interpretability, and sample efficiency in limited-data, regime-shift settings.

To address these gaps, we propose a two-stage framework that decouples baseline seasonality from COVID-19-induced excess hospitalizations and fuses a small set of search trends in a controlled manner. Stage 1 fits a lightweight unidirectional GRU on the pre-pandemic segment to estimate the baseline [15]. Stage 2 models the residual (excess) using independent Bi-GRU (bi-directional GRU) encoders for each trend (flu, COVID-19, fever) instead of the unidirectional GRU we use in the first stage, then followed by a standard multi-head self-attention layer to capture inter-trend dependencies [16,17]. We emphasize that composite outcome definitions can differ across surveillance systems (e.g., European composites can be constructed by summing pathogen-specific components, whereas U.S. RESP may represent a broader respiratory hospitalization burden); we therefore document outcome construction and alignment rules in the Methods section.

We utilize two distinct datasets to implement our experiments: For the United States (AME), we use weekly hospitalization surveillance from CDC platforms including RSV-NET, FluSurv-NET, and COVID-NET [18,19,20], which are part of the CDC RESP-NET respiratory virus hospitalization surveillance infrastructure [21]. For Europe (EU), we use weekly respiratory virus surveillance released by ECDC/WHO Europe, specifically the non-sentinel severity dataset (nonSentinelSeverity.csv) where the indicator hospitaladmissions is reported weekly by country. We additionally extract online search signals from Google Trends [22] as behavioral indicators that may lead hospitalization dynamics.

On the AME dataset, our two-stage multi-trend framework achieves substantial improvements over classical and modern baselines, providing a strong accuracy–efficiency trade-off, consistent with findings that carefully designed lightweight architectures can be competitive with heavier Transformer-style models for applied forecasting tasks [17,23]. Importantly, systematic trend ablations identify a robust “sweet spot”: three trends yield the best generalization, whereas adding an extra trend leads to overfitting—echoing prior observations that adding noisy behavioral signals can harm predictive performance in digital epidemiology [24,25]. These findings support a practical deployment message: mechanism decoupling and disciplined trend fusion can outperform heavier end-to-end architectures in limited-sample, regime-shift settings.

Although our experiments are set in the COVID-19 context, the proposed framework is designed for general outbreak-driven distribution shifts rather than a single disease: Stage 1 learns a stable pre-event baseline, and Stage 2 models the event-driven excess using timely exogenous signals. Consequently, the same pipeline can be retrained and deployed for future large-scale respiratory outbreaks or pandemics when historical baseline data and early-phase observations become available.

Our main contributions are as follows:

Deployment-oriented baseline-excess two-stage learning framework: Addressing distribution shifts induced by pandemics, we reformulate weekly respiratory hospitalization rate forecasting per 100,000 population during COVID-19 as a baseline–excess decomposition problem. In the first stage, a lightweight model learns stable pre-pandemic season patterns; the second stage focuses on the non-stationary excess component by integrating multi-source search trends. This design decouples complex dynamics and can be directly adapted to future respiratory outbreaks.
Lightweight multi-trend fusion with self-attention mechanism: We design a parameter-efficient and interpretable multi-source integration scheme. Each behavioral trend stream is encoded with an independent Bi-GRU (bi-directional GRU) and fused via a standard multi-head self-attention layer, achieving disciplined trend integration with limited parameters.
Ablation-driven insights and comprehensive evaluation: Through controlled ablation experiments, we reveal an optimal number of trends: three semantically grounded trends generalize best, while additional trends induce overfitting. Beyond standard point accuracy, we employ a rolling-origin protocol and calibrated prediction intervals to better reflect deployment-time forecasting and uncertainty needs. In addition, we incorporate split conformal prediction as a principled uncertainty quantification component, providing calibrated prediction intervals with finite-sample coverage guarantees, which further enhances the practical utility of the framework in deployment-oriented settings.

2. Related Work

2.1. Respiratory Hospitalization Forecasting

Recent respiratory hospitalization forecasting studies have increasingly emphasized operational deployment and subnational decision support. Mellor et al. developed hierarchical generalized additive models for forecasting influenza hospital admissions within English sub-regions and showed that carefully structured statistical models can remain highly competitive for short-term operational use [10]. More recently, Mellor et al. reported a real-time forecasting system for COVID-19, influenza, and RSV hospitalizations over winter 2023–2024 in England, demonstrating the practical value of ensemble-based admissions forecasting and its integration into healthcare planning workflows [11]. The studies confirm the importance of short-horizon, deployment-oriented forecasting for respiratory burden management.

2.2. Multi-Source Signals and Digital Epidemiology

A parallel line of work explores whether exogenous or digital signals can improve near-term forecasts. Klein et al. showed that mobility-informed models improved hospital-level COVID-19 admission forecasting, especially at longer horizons where admission history alone becomes less informative [12]. In the influenza domain, Ray et al. proposed Flusion, a multi-source forecasting framework that jointly leveraged multiple surveillance streams and achieved top performance in the CDC influenza forecasting challenge, highlighting the value of information sharing across both locations and data sources [13]. At the same time, a recent review of internet-based infectious disease prediction stressed that web-based and digital surveillance signals are promising but inherently noisy and platform-sensitive, so they must be incorporated with caution rather than indiscriminately [14].

2.3. Our Position Relative to Prior Work

Compared with the literature above, our work focuses on a different but complementary problem setting: forecasting a broad respiratory hospitalization burden under strong post-pandemic regime shifts. Existing recent studies have largely emphasized either operational ensemble forecasting for specific respiratory pathogens or direct incorporation of external indicators into a single forecasting model [11,12,13]. In contrast, our method explicitly decomposes the task into a baseline component and an excess component, aiming to separate relatively stable seasonal structure from outbreak-driven deviations. We therefore position our approach as a mechanism-disentangled and lightweight alternative for settings where interpretability, robustness to distribution shift, and disciplined use of a small number of exogenous signals are all important.

3. Materials and Methods

3.1. Target Definition: Weekly Hospitalization Rate per 100,000 Population (Weekly New Admissions)

Let

a_{t}

denote the weekly new hospital admissions (counts) at epidemiological week t, and let N be the population denominator for the corresponding region. Following the reporting convention of the official data source [21], the weekly hospitalization rate per 100,000 population is defined as

y_{t} = \frac{10^{5} \times a_{t}}{N},

(1)

reported as per 100,000 population. Throughout the paper, we use weekly hospitalization rate to avoid confusion with cumulative quantities.

3.2. Model Architectures and Parameter Selection

The architectural workflow of our proposed framework is illustrated in Figure 1. The model operates through two sequential stages, designed to decouple baseline seasonality from pandemic-driven excess.

3.2.1. Stage 1: Baseline Estimation

As shown in the left panel of Figure 1, Stage 1 employs a lightweight Gated Recurrent Unit (GRU) to learn intrinsic seasonal dynamics from pre-pandemic data. We selected a hidden size of 32 and a dropout rate of 0.5 based on grid search to prevent overfitting on the limited historical baseline data. The model takes a lookback window of

L = 12

weeks, determined through preliminary experiments to optimally capture incubation periods. This stage outputs the baseline prediction (

{\hat{y}}_{b a s e l i n e}

), and the residual is computed as the target for the next stage. The residual series for Stage 2 is defined as the observed hospitalization rate minus the Stage 1 baseline prediction. Stage 2 is trained independently on the residual series. The Stage 1 baseline model is frozen after pre-training and serves only to provide baseline predictions for residual construction.

3.2.2. Stage 2: Multi-Trend Correction

The second stage (right panel of Figure 1) models the excess component driven by exogenous factors. (i) Trend Encoding: Each normalized search trend stream is processed by an independent Bi-GRU (bi-directional GRU) encoder. A hidden size of 32 was chosen to provide sufficient capacity for capturing complex non-linear patterns in search behavior. (ii) Inter-Trend Interaction: To capture dependencies between different search queries (e.g., “Fever” vs. “Loss of Smell”), we apply a Multi-Head Self-Attention mechanism. We utilized 4 attention heads, a configuration adopted from standard Transformer architectures, as it offered the best stability–performance trade-off during validation. This attention layer primarily serves as a fusion mechanism across trends, rather than for extracting interpretable feature importance. (iii) Fusion and Prediction: The attention outputs are concatenated with the encoded residual features and passed through a fusion layer (hidden size 24, dropout 0.3) to predict the excess component (

{\hat{y}}_{e x c e s s}

). The final forecast is obtained by summing the baseline and excess predictions:

{\hat{y}}_{f i n a l} = {\hat{y}}_{b a s e l i n e} + {\hat{y}}_{e x c e s s}

.

3.2.3. Parameter Selection Rationale

The key model hyperparameters were determined through a combination of empirical testing on the validation set and adoption from relevant literature. For the Stage 1 GRU, a hidden size of 32 with a single layer was selected based on grid search to minimize overfitting on the smaller pre-pandemic dataset. For the Stage 2 Trend Encoder, the Bi-GRU hidden size was set to 32 with a dropout of 0.2 to accommodate the higher complexity of multi-source inputs. The number of attention heads was set to 4 based on stability tests, while the fusion layer used a hidden size of 24 and a dropout of 0.3. The input window length (lookback) was fixed at 12 weeks, as preliminary experiments showed this duration optimally captures the incubation and reporting delays typical of respiratory illnesses. All optimization was performed using the Adam optimizer with a learning rate of

1 \times 10^{- 3}

and early stopping patience of 15 epochs for the first stage and 25 for the second stage respectively.

3.3. Datasets, Signal Timing, and Preprocessing

3.3.1. Study Datasets and Regional Definitions

We study two surveillance settings: the United States (AME) and Europe (EU).

For AME, the prediction target

y_{t}^{RESP}

is the weekly total respiratory hospitalization rate per 100,000 population reported by CDC RESP-NET, a population-based hospitalization surveillance system that is more directly aligned with our forecasting task than outpatient systems. The study period spans from October 2018 to November 2025 and contains 350 weekly observations after temporal alignment. Importantly, this RESP target is not constructed by summing RSV/flu/SARS-CoV-2 series; rather, the pathogen-specific networks provide related (and partially overlapping) laboratory-confirmed signals that can serve as auxiliary inputs. In our modeling, we treat

y_{t}^{RESP}

as the operational target for healthcare planning, and use RSV/flu/COVID signals together with Google Trends terms (“flu”, “COVID-19”, “fever”, and “loss of smell”) as explanatory covariates when available. These auxiliary series were aligned to the weekly timeline and normalized to

[0, 1]

with Min–Max scaling; missing values were filled by linear interpolation, and no automatic outlier removal was performed because extreme peaks usually correspond to genuine epidemic surges. For instance, our dataset aligns the weekly hospitalization rate (e.g., 7.04 per 100,000 in Week 42, 2020) with synchronized relative search volumes (e.g., ‘loss of smell’ index: 65/100) to capture real-time public interest alongside clinical outcomes. Over the study period, the AME target series has a mean of 7.04 and a standard deviation of 5.77, and ranges from 0.20 to 37.10 per 100,000 population.

For EU, the surveillance file provides weekly hospital admission counts by pathogen. We construct a composite admission count by direct summation:

a_{t}^{EU, RESP} = a_{t}^{EU, COVID} + a_{t}^{EU, Flu} + a_{t}^{EU, RSV},

(2)

and convert it to an approximate weekly hospitalization rate per 100,000 population using Equation (1) with a country-level population denominator

N_{c}

. This construction does not perform de-duplication for potential co-infections; thus, the resulting composite should be interpreted as an additive burden proxy under the released reporting structure. Because EU surveillance coverage, reporting practices, and retrospective updates may vary across countries, we treat the per-100k conversion as an approximation and report country-level results accordingly.

While both AME and EU targets are expressed as weekly respiratory hospitalization burden per 100,000 population, they are not constructed identically. Therefore, the two settings are comparable at the level of broad operational burden forecasting, but not as strictly harmonized outcome definitions.

3.3.2. Signal Timing Analysis (Hospital vs. Google Trends)

To assess whether public search behavior provides actionable temporal cues, we analyzed the timing relationship between hospital surveillance and Google Trends during the Delta (June–October 2021) and Omicron (November 2021–February 2022) waves.

Hospital surveillance outcomes are available at weekly resolution (CDC AME). Google Trends queries (“fever”, “flu”, “COVID-19”; United States nationwide; web search) are subject to retrospective API constraints for multi-year queries (2018–2025), which results in monthly aggregation (first day of each month). Therefore, peak timing estimates from Google Trends carry an inherent uncertainty of approximately ±2–4 weeks due to monthly sampling.

For each wave, we defined an extended analysis window (wave period ±12 weeks) and normalized all signals to

[0, 1]

using MinMax scaling within this extended window to enable visual comparison. Peak timing was defined as the date of the maximum normalized value within the primary wave window. Figure 2 displays dashed linear interpolation between monthly Google Trends points for visualization only; all timing calculations use the original monthly datapoints.

3.3.3. Temporal Alignment and Preprocessing

AME series are indexed by CDC epidemiological weeks (MMWR weeks), while EU series are indexed by ISO weeks. We align all inputs and targets to a common weekly timeline by using each dataset’s published week index and mapping them to consistent week-ending dates before merging with Google Trends.

We use the aggregated (overall) age group when available to match the operational “all-ages” hospitalization burden. For the EU, population denominators

N_{c}

are obtained from publicly available national population statistics and treated as constant over the study period to support the per-100 k conversion.

After aligning to the weekly index, pathogen-week cells that are absent in the EU pathogen table are filled with zero for composite construction in Equation (2). For AME auxiliary pathogen signals, short gaps are linearly interpolated when necessary; pre-2020 COVID auxiliary values are set to zero to reflect absence prior to the pandemic onset.

3.4. Rolling-Origin Evaluation and Uncertainty Quantification

To mimic deployment-time updating and reduce sensitivity to a single fixed split, we complement the chronological evaluation with an expanding-window rolling-origin protocol on a pre-defined consecutive test window. This design provides a more realistic assessment of out-of-sample forecasting performance under sequential data arrival.

3.4.1. Rolling-Origin Protocol

Starting from an anchor week

t_{0}

, we repeatedly refit the full forecasting pipeline using all observations up to week

t - 1

and generate a one-step-ahead forecast

{\hat{y}}_{t}

for week t. This procedure is repeated for

t = t_{0}, t_{0} + 1, \dots, t_{0} + H - 1

, producing H consecutive out-of-sample forecasts over a fixed rolling test window.

Unless otherwise stated, all model fitting and preprocessing steps at time t use only information available no later than week

t - 1

, thereby preventing look-ahead bias and ensuring strict temporal consistency.

3.4.2. Conformal Prediction Intervals

To quantify predictive uncertainty, we adopt split conformal prediction with a held-out calibration set at each rolling step, a distribution-free framework that provides finite-sample coverage guarantees under exchangeability. After fitting the forecasting model on the training portion, we compute absolute residuals on the calibration set:

r_{i} = | y_{i} - {\hat{y}}_{i} | .

Let n denote the size of the calibration set. For a nominal miscoverage level

α

, we take the empirical

(1 - α)

quantile

q_{1 - α}

of

{r_{i}}

, i.e., the

⌈ (n + 1) (1 - α) ⌉ / n

empirical quantile, and construct a symmetric prediction interval:

{PI}_{α} (t) = [{\hat{y}}_{t} - q_{1 - α}, {\hat{y}}_{t} + q_{1 - α}] .

(3)

We report 80% (

α = 0.2

) and 95% (

α = 0.05

) conformal prediction intervals and summarize empirical coverage and average interval width over the rolling test window. Because conformal calibration relies only on held-out residuals, it provides distribution-free validity under exchangeability, meaning that the error distribution is assumed to remain stable from the calibration period to the test period, and helps mitigate under-dispersed uncertainty estimates. Unlike parametric uncertainty quantification methods, this approach does not require Gaussian assumptions about the data distribution.

3.4.3. Evaluation Metrics

We report standard point-forecast metrics including RMSE, MAE, MAPE, and

R^{2}

aggregated over the rolling test window. In addition, we evaluate uncertainty calibration using prediction-interval (PI) metrics: empirical coverage and average interval width at nominal 80% and 95% levels. Coverage is computed as the fraction of test points whose true outcomes fall within the reported PIs. Let

y_{i}

and

{\hat{y}}_{i}

denote the observed and predicted values for the i-th sample, respectively, and let n denote the total number of samples;

\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}

is the mean of the observed values.

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(4)

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(5)

MAPE = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|

(6)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(7)

While these metrics are widely used, they are selected here for their complementary insights into predictive performance. Mean Absolute Error (MAE) provides a direct measure of the average error magnitude in the original scale (hospitalizations per 100,000 population), offering an intuitive sense of physical deviation. Root Mean Square Error (RMSE) is incorporated to penalize larger forecasting errors more heavily than MAE. However, since hospitalization rates exhibit significant seasonal fluctuations, Mean Absolute Percentage Error (MAPE) is included to provide a relative error perspective, ensuring that the model’s performance is evaluated proportionately across both peak and off-peak periods. In contrast to these point-wise error metrics, the R-squared (

R^{2}

) coefficient evaluates the proportion of variance explained by the model, reflecting its capability to capture the underlying temporal trends and the overall “shape” of the hospitalization curves.

The proposed two-stage forecasting framework is formally presented in Algorithm 1.

Algorithm 1 Two-StageBaseline–Excess Forecasting.

Require:: Weekly outcomes ${y_{t}}$ , trend features ${X_{t}}$ , split date $T_{split}$ , window length L
1:: Train baseline model $f_{base}$ on weeks $t < T_{split}$
2:: Compute baseline predictions ${\hat{y}}_{t}^{Base}$ for $t \geq T_{split}$
3:: Construct residuals $r_{t} = y_{t} - {\hat{y}}_{t}^{Base}$
4:: Train excess model $f_{excess}$ on $(r_{t - L + 1 : t}, X_{t}^{trend})$
5:: Output ${\hat{y}}_{t} = {\hat{y}}_{t}^{Base} + {\hat{y}}_{t}^{Excess}$

3.5. Datasets and Experimental Setup

3.5.1. Datasets and Splits

Table 1 summarizes the dataset statistics and split settings for all regions, including the full time span, number of weekly samples, and the split date used to separate baseline and excess periods. For AME dataset, we report the resulting baseline and excess sample counts used by our two-stage framework. For European datasets, we analyze the results by offering a report on the applicability of the framework. The results are presented in the next section.

3.5.2. Baselines

We compare against classical and modern forecasting baselines, including ARIMA/SARIMA-family models, Prophet, and representative deep sequence models (LSTM/GRU, Transformer-style models, and TCN), as reported in our experimental results.

3.5.3. Metrics

We evaluate predictive performance using

R^{2}

and MAPE as primary metrics. On the rolling test window, we report standard point-forecast metrics (RMSE/MAE/MAPE/

R^{2}

). In addition, we assess uncertainty calibration via empirical coverage and mean interval width for 80% and 95% prediction intervals.

3.6. Interpretability and Signal Timing Analysis

To provide interpretable evidence of how the proposed framework leverages multi-source signals, we conducted a signal-timing analysis between Google Trends and hospital admissions and reported qualitative/quantitative interpretability results.

Wave identification: Epidemic waves were detected using an objective threshold-based rule: a wave is defined as ≥4 consecutive weeks where the RESP series exceeds the 75th percentile of the full time series. Wave start/end are the first/last weeks above the threshold, and the peak is the week of maximum RESP within the wave.

Lead-time quantification (primary metric): For each wave, we computed the peak week difference between Google Trends and RESP admissions. Let

t_{x}^{*}

be the peak week of a Google Trends series and

t_{y}^{*}

the peak week of RESP; the lead time is

Δ = t_{y}^{*} - t_{x}^{*}

, where

Δ > 0

indicates Google Trends precedes hospital peaks.

Notes on feature-importance visualization: We additionally report an interpretability visualization that illustrates the intended dynamic weighting behavior of the attention-based fusion module across epidemic phases. (If required by reviewers, attention weights can be extracted from the trained self-attention layer via forward hooks; please visit our GitHub (https://github.com/Ginger118/MultiTrend-TwoStage-Forecast, accessed on 23 February 2026) repository for the relevant implementation details.)

3.7. Reproducibility Checklist

To facilitate replication of the experiments, we provide a detailed reproducibility checklist summarizing all key configurations and settings. We summarize the complete reproducibility requirements for all experiments in Table 2.

Full code and exact seeds are available at our public GitHub repository to allow readers to exactly replicate all reported experiments.

4. Results

4.1. Main Results on AME (U.S.)

We first evaluate on the AME dataset under a time-respecting design consistent with the two-stage setup. Stage 1 (baseline) is trained only on the pre-pandemic segment (2018–2019) and then used to generate baseline predictions for the COVID-era period. For point-forecast comparison in Table 3 and Figure 3, we focus on the COVID-era segment (2020–2025) and apply a chronological 70/30 split within this segment to reflect realistic forward forecasting. The 70/30 split was chosen as a practical compromise between training sufficiency and evaluation reliability: the training portion remains long enough to learn stable seasonal and outbreak-related patterns, while the holdout portion is still sufficiently long for rolling out-of-sample assessment. Table 3 reports the primary scale-free metrics (

R^{2}

and MAPE). Our proposed model achieves competitive overall performance relative to the baseline models, reaching R² = 0.907 and reducing MAPE to 19.22%. Notably, the strongest single-stage baseline is TCN (

R^{2} = 0.843

, MAPE 20.24%), while most other baselines exhibit negative

R^{2}

, indicating substantial difficulty in generalization under pandemic-era regime shifts.

Beyond point accuracy (and as further examined in the rolling-origin evaluation below), the two-stage design provides an operationally meaningful decomposition: Stage 1 learns a stable baseline from the pre-pandemic segment, and Stage 2 forecasts the excess component on top of that baseline. This separation improves robustness when the data distribution changes sharply (e.g., variant-driven waves and shifting admission/testing behaviors), since Stage 2 is no longer forced to simultaneously fit seasonal structure and shock-driven dynamics.

4.2. Robustness: Rolling Forecast Evaluation on AME

We further evaluate robustness using an expanding-window rolling-origin protocol that mimics a deployment setting where the model is repeatedly updated as new weekly observations arrive. We generate one-step-ahead forecasts on a fixed consecutive test window (13 August 2022 to 8 November 2025;

H = 170

) and report both point accuracy and uncertainty calibration.

Table 4 summarizes rolling metrics under the same test window, including RMSE/MAE and

R^{2}

, as well as empirical coverage and average interval widths for prediction intervals (PIs). For uncertainty quantification, we adopt split conformal calibration on a held-out calibration set and report 80% PIs for all methods; for our model we additionally report 95% PIs.

Importantly, the rolling-origin results highlight a different trade-off from the fixed-split comparison in Table 3. On this particular test window, a simple autoregressive baseline (Ridge_L6) achieves the best one-step-ahead point accuracy, indicating that short-lag autoregression is highly competitive when the recent history is strongly predictive. In contrast, our two-stage model prioritizes a structured baseline–excess decomposition and provides calibrated uncertainty estimates via conformal prediction. As shown in Table 4, our prediction intervals achieve coverage close to or above the nominal levels (slightly conservative), which is valuable for risk-aware operational planning where decision-makers need reliable bounds rather than point forecasts alone.

Figure 4 compares two-stage + multi-trend (from-scratch updating) rolling trajectories against the best baseline under the same window. Figure 5 further visualizes the coverage–width trade-off for the 80% prediction interval under the same rolling-window setting on AME. Specifically, it compares empirical coverage against average interval width, thereby showing how uncertainty calibration complements point accuracy in deployment-oriented evaluation. The figure indicates that different models balance calibration and sharpness differently, providing an additional probabilistic perspective beyond point-forecast error alone.

At the end of the rolling-origin evaluation, we evaluated the calibration of conformal prediction intervals using the rolling-origin evaluation. Table 5 summarizes the empirical coverage and deviation from nominal levels, while Figure 6 visualizes the calibration across 10–95% nominal coverage levels.

The calibration plot indicates that the conformal prediction intervals are generally well-calibrated, with a mean absolute deviation of 0.065 from the nominal levels. At lower nominal levels (10–60%), the empirical coverage closely tracks the diagonal. At higher nominal levels (70–95%), the intervals are somewhat conservative, which is preferable for public health decision-making where under-coverage carries higher risk. Overall, the conformal framework provides reliable uncertainty quantification across the full range of coverage levels.

The results show that the two-stage framework consistently outperforms simple seasonal baselines across all tested Stage-1 baseline lengths. Even the shortest pre-pandemic window (26 weeks) achieves substantially better accuracy than the seasonal baselines, confirming that predictive gains are primarily driven by the baseline–excess decomposition rather than the absolute number of pre-pandemic weeks.

4.3. Sensitivity of Stage-1 Baseline Length

We evaluated the effect of varying the Stage-1 pre-pandemic baseline length on the AME dataset and compared the resulting forecasts against simple seasonal baselines, including a seasonal naive baseline and a regression baseline with seasonal terms. Table 6 summarizes the performance metrics (RMSE, MAE, MAPE) for different baseline lengths (26, 35, 44 weeks).

4.4. Ablation: How Many Trends to Fuse?

We conduct controlled ablations by varying the number of Google Trends streams fused in Stage 2. Table 7 and Figure 7 show a clear “sweet spot”: incorporating three trends yields the best generalization, while adding a fourth trend leads to severe overfitting (

R^{2}

collapses and MAPE increases sharply). This result supports a practical takeaway for deployment: behavioral signals help when they are disciplined and complementary, but indiscriminately adding weak or redundant trends can destabilize forecasting.

We conducted a leave-one-feature-out ablation study across three chronological sub-periods (early: until June 2021; mid: until January 2023; late: until November 2025) to assess the temporal stability of selected Google Trends keywords for Stage-2 excess modeling. Table 8 summarizes the impact of removing each keyword on forecast accuracy (RMSE, MAE, MAPE). The results indicate that the model’s reliance on individual Google Trends keywords varies across pandemic phases. In the mid and late periods, removing COVID-19 search trends generally yields the largest increase in RMSE relative to the full-feature baseline, confirming its importance as a leading indicator. All four keywords contribute meaningfully to overall forecast accuracy, validating the multi-source design and temporal stability of the selected keywords.

4.5. Interpretability: Signal Timing and Dynamic Feature Focus

Beyond predictive accuracy, we provide interpretable and verifiable evidence on the temporal relationship between public search behavior (Google Trends) and weekly respiratory hospitalizations in AME. We focus on wave-level timing patterns and explicitly disclose the data-resolution constraints of retrospective Google Trends queries.

4.5.1. Key Constraint: Google Trends Is Monthly for Multi-Year Retrospective Queries

For the AME analysis spanning multiple years (2018–2025), Google Trends returns monthly observations due to retrospective aggregation. Consequently, peak timing comparisons involving Google Trends carry an inherent uncertainty of approximately ±2–4 weeks. In Figure 2, dashed lines denote linear interpolation used for visualization only; all peak-timing statements are based on the original monthly datapoints.

4.5.2. Real-Data Evidence: Wave-Level Alignment with Mixed Lead–Lag Across Keywords

Figure 8 (Delta) and Figure 9 (Omicron) compare weekly hospital RESP rates against monthly Google Trends signals. During the Delta wave, the hospital peak occurs on 4 September 2021, while fever searches peak on 1 September 2021 (approximately contemporaneous at monthly resolution). COVID-19 searches peak around 1 August 2021 (about one month earlier), whereas flu searches peak around 1 October 2021 (about one month later). During the Omicron wave, the hospital peak occurs on 8 January 2022, while fever and COVID-19 searches peak around 1 January 2022 (approximately contemporaneous within ±2–4 weeks), and flu searches peak around 1 December 2021 (about one month earlier). Overall, the figures suggest that search behavior broadly co-moves with hospital dynamics at the wave scale, but because the Google Trends signals are only available retrospectively at monthly resolution, these comparisons should be interpreted as qualitative evidence rather than as strong week-scale early-warning claims. We further summarize the peak timing differences between hospital RESP data and Google Trends in Table 9.

4.5.3. Conceptual Illustration: Dynamic Feature Focus in Attention-Based Fusion (Optional)

Figure 10 provides a conceptual illustration of how an attention-based fusion module may shift relative emphasis across signals across epidemic phases. This figure is included for intuition and is not produced by extracting attention weights from a trained model.

4.6. Europe: Failure Analysis, Evidence, and an Improvement Experiment

We next evaluate cross-region transferability on European surveillance series. In contrast to AME, most European country series begin around mid-2021, meaning that a fixed pandemic-onset split (e.g., 1 January 2020) leaves zero pre-pandemic baseline weeks. This violates the prerequisite of the two-stage design (Stage 1 requires a baseline segment), and explains why naive two-stage training can fail in Europe.

Failure (root cause): Under a fixed split date aligned to 2020, all European series have baseline length equal to 0, so Stage 1 cannot be trained as intended.

Evidence (why Stage 1 is still difficult even after restoring baseline): We implement an adaptive split strategy that selects a split date to ensure a non-trivial baseline segment (approximately 30–70% of available weeks, depending on the country). This restores 91–149 baseline weeks for most regions, but Stage 1 baseline learning remains challenging due to high noise and systematic bias (see Table 10). We quantify baseline quality using coefficient of variation (CV), signal-to-noise ratio (SNR), and residual CV. In several countries, ridge-style baselines may yield superficially better in-sample fit but introduce large bias, inflating residual variance and propagating error into Stage 2.

Improvement (what we changed and what it buys us): Based on the above evidence, we adopt (i) adaptive split to restore baseline length, and (ii) a simpler seasonal baseline choice to reduce bias propagation. Table 11 reports the resulting European forecasting performance. While the method becomes usable in several settings (positive

R^{2}

in multiple countries/regions), some countries remain difficult—consistent with shorter baselines (e.g., Hungary) and higher noise/bias (e.g., France), highlighting that data heterogeneity is a dominant bottleneck in Europe.

5. Discussion

5.1. Clinical and Operational Relevance

Forecasting hospitalization rates per 100,000 population supports operational dashboards that trigger timely situational awareness alerts and guide capacity adjustments. A key advantage of the proposed framework is its interpretability: Stage 1 provides a stable seasonal baseline, and Stage 2 estimates COVID-19-driven excess rates, which can be used to contextualize surges.

5.2. Complexity Analysis

The proposed Multi-Trend Cross-Attention framework is designed for computational efficiency. The total number of trainable parameters is approximately 12,718 under the default configuration. The time complexity of the trend encoders is

O (T \cdot L \cdot H^{2})

, where L is the sequence length, T is the number of exogenous trend signals, and H represents the hidden state dimension. The cross-attention mechanism operates across the trend dimension with a complexity of

O (T^{2} \cdot H)

, independent of L. T is still the number of exogenous trend signals, and H still represents the hidden state dimension. Consequently, the overall computational cost remains low, enabling rapid retraining and deployment on standard hardware without the need for high-performance computing clusters.

5.3. Why Baseline–Excess Decoupling Helps

Single-stage models must allocate capacity to simultaneously fit seasonal patterns and abrupt pandemic shocks. By separating these mechanisms, our approach reduces feature entanglement and focuses Stage 2 on non-stationary residual structure, which is better aligned with the epidemiological interpretation of COVID-19-driven excess.

5.4. Why Performance Drops in Some European Settings

European datasets often start later and may lack clean pre-pandemic baseline segments, weakening Stage 1 and propagating errors to the residual series. Small sample sizes further increase overfitting risk for attention-based models. These findings motivate transfer learning, automated split detection, and privacy-preserving cross-region learning (e.g., federated learning).

5.5. Limitations and Future Work

In this study, we use only Trends-based exogenous signals along with CDC hospitalization data, excluding other drivers such as policy interventions, vaccination, mobility, or climate. Accordingly, all performance claims are limited to the Trends-only setting. We focus on one-step forecasting, and future work will explore multi-horizon forecasting, uncertainty quantification, automated split learning, and incorporating richer exogenous features.

In summary, the fixed-split evaluation shows consistently positive performance of the proposed framework, while the rolling-origin evaluation indicates somewhat weaker results for point forecasting. This difference highlights the impact of sequential data arrival and evolving trends on model performance.

We note that the proposed model is most beneficial when a structured baseline–excess decomposition and calibrated uncertainty estimates are desired. In situations where simpler seasonal or autoregressive baselines already provide adequate performance, the incremental gain from the proposed framework may be limited. This guidance helps readers understand when the model is expected to provide practical advantages versus when simpler baselines may suffice.

6. Conclusions

We present a two-stage deep learning framework for forecasting respiratory virus–associated hospitalization rates per 100,000 population during the COVID-19 era. By decoupling baseline seasonality from COVID-19-induced excess and fusing multi-source trend signals via Multi-Trend Self-Attention (Bi-GRU encoders with standard multi-head self-attention), our approach achieved strong performance on the AME dataset and provided interpretable trend-aware excess estimates. While cross-region generalization remains challenging in baseline-scarce settings, the method provides a practical pathway for timely public health situational awareness and healthcare capacity planning.

Author Contributions

Conceptualization, Z.L. and J.B.; methodology, Z.L. and F.M.; software, Z.L., F.M. and H.L.; validation, Z.L., F.M., H.L. and J.B.; data curation, F.M. and Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., F.M. and J.B.; funding acquisition, J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanxi Provincial Administration of Traditional Chinese Medicine Research Project, grant number 2024ZYY2A033.

Data Availability Statement

Code to reproduce the experiments is available at https://github.com/Ginger118/MultiTrend-TwoStage-Forecast (accessed on 23 February 2026). The AME surveillance data are from CDC public sources. Google Trends data can be retrieved via the Google Trends interface/API using the keywords described in this paper.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-4 series) for language polishing and structural editing, and GitHub Copilot (February 2026) for code assistance and inline documentation support. The authors have reviewed and edited all AI-assisted outputs and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AME	U.S. CDC respiratory hospitalization dataset used in this study
ARIMA	AutoRegressive Integrated Moving Average
CDC	Centers for Disease Control and Prevention
COVID-19	Coronavirus Disease 2019
COVID-NET	Coronavirus Disease 2019-Associated Hospitalization Surveillance Network
CV	Coefficient of Variation
ECDC	European Centre for Disease Prevention and Control
EU	European Union
EU/EEA	European Union/European Economic Area
FluSurv-NET	Influenza Hospitalization Surveillance Network
GRU	Gated Recurrent Unit
ICU	Intensive Care Unit
ISO	International Organization for Standardization (ISO week calendar)
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
MMWR	Morbidity and Mortality Weekly Report (epidemiological week system)
MTA	Multi-Trend Self-Attention
NPI	Non-Pharmaceutical Intervention
PI	Prediction Interval
RESP	Composite Respiratory Hospitalization Rate
RESP-NET	Respiratory Virus Hospitalization Surveillance Network
RMSE	Root Mean Squared Error
RSV	Respiratory Syncytial Virus
RSV-NET	Respiratory Syncytial Virus Hospitalization Surveillance Network
SARIMA	Seasonal AutoRegressive Integrated Moving Average
SNR	Signal-to-Noise Ratio
TCN	Temporal Convolutional Network
WHO	World Health Organization

References

Reich, N.G.; Brooks, L.C.; Fox, S.J.; Kandula, S.; McGowan, C.J.; Moore, E.; Osthus, D.; Ray, E.L.; Tushar, A.; Yamana, T.K.; et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proc. Natl. Acad. Sci. USA 2019, 116, 3146–3154. [Google Scholar] [CrossRef] [PubMed]
Sherratt, K.; Abbood, A.; Abbas, K.; Abdollahi, M.; Abdelalim, A.M.; Abdulkader, R.S.; Abebe, M.; Abedi, A.; Abedi, V.; Abidi, H.; et al. Predicting local COVID-19 hospital admissions using real-time data. BMC Med. 2021, 19, 25. [Google Scholar]
Bogoch, I.I.; Watts, A.; Thomas-Bachli, A.; Huber, C.; Kraemer, M.U.G.; Khan, K. Hospital surge capacity and modelling the impact of the COVID-19 pandemic on health systems. Lancet 2021, 397, 682–684. [Google Scholar]
Lewnard, J.A.; Liu, V.X.; Jackson, M.L.; Schmidt, M.A.; Jewell, B.L.; Flores, J.P.; Jentz, C.; Northrup, G.R.; Mahmud, A.; Reingold, A.L.; et al. Incidence, clinical outcomes, and transmission dynamics of hospitalized coronavirus disease 2019 among 9,596,321 individuals residing in California and Washington, United States. Clin. Infect. Dis. 2020, 71, 2712–2722. [Google Scholar]
Faes, C.; Abrams, S.; Van Beckhoven, D.; Meyfroidt, G.; Vlieghe, E.; Hens, N. Time between symptom onset, hospitalisation and recovery or death: Statistical analysis of Belgian COVID-19 patients. Int. J. Environ. Res. Public Health 2020, 17, 7560. [Google Scholar] [CrossRef] [PubMed]
Chaves, S.S.; Lynfield, R.; Lindegren, M.L.; Peerke, S.; Jhung, M.; Finelli, L. The US Influenza Hospitalization Surveillance Network. Epidemiol. Infect. 2013, 141, 531–539. [Google Scholar] [CrossRef] [PubMed]
Datta, S.; Talley, P.; Hall, A.J.; Whitaker, M.; Pham, H.; Milucky, J.; Anglin, O.; O’Halloran, A.; Wortham, J.M.; Chai, S.J.; et al. COVID-NET: A national surveillance system for laboratory-confirmed COVID-19-associated hospitalizations. Clin. Infect. Dis. 2021, 73, e3851–e3858. [Google Scholar]
Ray, E.L.; Wattanachit, N.; Niemi, J.; Kanji, A.H.; House, K.; Cramer, E.Y.; Bracher, J.; Zheng, A.; Yamana, T.K.; Xiong, X.; et al. Ensemble forecasts of coronavirus disease 2019 (COVID-19) in the U.S. medRxiv 2020. [Google Scholar] [CrossRef]
Jewell, N.P.; Lewnard, J.A.; Jewell, B.L. Forecasting COVID-19 hospital demand in the United States. JAMA Netw. Open 2020, 3, e2011605. [Google Scholar]
Mellor, J.; Christie, R.; Overton, C.E.; Paton, R.S.; Leslie, R.; Tang, M.; Deeny, S.; Ward, T. Forecasting influenza hospital admissions within English sub-regions using hierarchical generalised additive models. Commun. Med. 2023, 3, 190. [Google Scholar] [CrossRef] [PubMed]
Mellor, J.; Tang, M.L.; Jones, O.; Ward, T.; Riley, S.; Deeny, S.R. Forecasting COVID-19, influenza, and RSV hospitalizations over winter 2023–2024 in England. Int. J. Epidemiol. 2025, 54, dyaf066. [Google Scholar] [CrossRef] [PubMed]
Klein, B.; Zenteno, A.C.; Joseph, D.; Zahedi, M.; Hu, M.; Copenhaver, M.S.; Kraemer, M.U.G.; Chinazzi, M.; Klompas, M.; Vespignani, A.; et al. Forecasting hospital-level COVID-19 admissions using real-time mobility data. Commun. Med. 2023, 3, 25. [Google Scholar] [CrossRef] [PubMed]
Ray, E.L.; Wang, Y.; Wolfinger, R.D.; Reich, N.G. Flusion: Integrating multiple data sources for accurate influenza predictions. Epidemics 2024, 50, 100810. [Google Scholar] [CrossRef] [PubMed]
McClymont, H.; Lambert, S.B.; Barr, I.; Vardoulakis, S.; Bambrick, H.; Hu, W. Internet-based Surveillance Systems and Infectious Diseases Prediction: An Updated Review of the Last 10 Years and Lessons from the COVID-19 Pandemic. J. Epidemiol. Glob. Health 2024, 14, 645–657. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Centers for Disease Control and Prevention. Respiratory Syncytial Virus Hospitalization Surveillance Network (RSV-NET). 2026. Available online: https://www.cdc.gov/rsv/php/surveillance/rsv-net.html (accessed on 2 February 2026).
Centers for Disease Control and Prevention. Influenza Hospitalization Surveillance Network (FluSurv-NET). 2026. Available online: https://www.cdc.gov/fluview/overview/influenza-hospitalization-surveillance.html (accessed on 2 February 2026).
Centers for Disease Control and Prevention. Coronavirus Disease 2019 (COVID-19) Hospitalization Surveillance Network (COVID-NET). 2026. Available online: https://www.cdc.gov/covid/php/covid-net/index.html (accessed on 2 February 2026).
Centers for Disease Control and Prevention. Respiratory Virus Hospitalization Surveillance Network (RESP-NET) Dashboard. 2026. Available online: https://www.cdc.gov/resp-net/dashboard/index.html (accessed on 2 February 2026).
Google LLC. Google Trends. 2026. Available online: https://trends.google.com/ (accessed on 2 February 2026).
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
Lazer, D.; Kennedy, R.; King, G.; Vespignani, A. The parable of Google Flu: Traps in big data analysis. Science 2014, 343, 1203–1205. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Santillana, M.; Kou, S.C. Accurate influenza epidemic modelling and prediction using Google search data via ARGO. Proc. Natl. Acad. Sci. USA 2015, 112, 14473–14478. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The architecture of the model.

Figure 2. Visualization methodology for monthly Google Trends data. Orange dots represent the original monthly search volume indices aggregated by Google. The dashed orange line indicates linear interpolation, used strictly for visualization alignment with weekly hospitalization data (blue solid line). Quantitative analyses of peak timing and correlation rely solely on the discrete monthly observations, acknowledging an inherent temporal resolution limit of approximately ±2–4 weeks. All signals are min-max-normalized to [0, 1] within the analysis window for comparative purposes.

Figure 3. Comprehensive baseline comparison on AME (COVID-era).

Figure 4. Rolling one-step-ahead forecasts on AME (fair window): Two-stage + multi-trend (FromScratch) versus the best baseline. Shaded region indicates the conformal 95% PI for the two-stage model.

Figure 5. Coverage–width trade-off (80% PI) under the same rolling window on AME.

Figure 6. Calibration plot showing empirical coverage versus nominal coverage for prediction intervals ranging from 10% to 95%. The diagonal reference line (y = x) indicates perfect calibration.

Figure 7. Trend-count ablation on AME. Performance peaks with three trends and collapses with four trends due to overfitting.

Figure 8. Case study (Delta, AME) using the same protocol as Figure 9. Google Trends datapoints are monthly (circles), with dashed interpolation shown for visualization only. Timing interpretation is therefore limited to wave-scale patterns (±2–4 weeks uncertainty).

Figure 9. Case study (Omicron, AME): weekly hospital RESP (top) versus monthly Google Trends datapoints (middle; circles are observed monthly values). Dashed lines indicate linear interpolation for visualization only. The bottom panel overlays normalized signals. Because Google Trends is monthly in this retrospective setting, peak timing has an inherent uncertainty of ±2–4 weeks.

Figure 10. Conceptual illustration of phase-dependent feature emphasis in attention-based fusion. This visualization is illustrative and is not obtained by extracting attention weights from a trained model.

Table 1. Dataset statistics (weekly samples).

Dataset	Time Span	Total Samples	Split Date	Baseline Samples	Excess Samples
AME (U.S.)	October 2018–November 2025	350	1 January 2020	44	306
Czechia	June 2021–November 2025	230	12 February 2024	138	92
Estonia	June 2021–November 2025	229	12 February 2024	137	92
France	June 2021–November 2025	230	5 June 2023	102	128
Hungary	January 2023–February 2025	88	27 May 2024	52	36
Ireland	June 2021–November 2025	230	12 February 2024	138	92
Malta	June 2021–November 2025	230	12 February 2024	138	92
Slovakia	June 2021–November 2025	230	12 February 2024	138	92

Table 2. Reproducibility checklist for all experiments in the manuscript.

Category	Item	Configuration/Value
Data Splitting	Total Duration	6 October 2018 to 8 November 2025 (350 weeks)
	Stage 1 (Baseline) Period	6 October 2018 to 28 December 2019 (44 weeks)
	Stage 2 (COVID) Period	4 January 2020 to 8 November 2025 (306 weeks)
	Evaluation Strategy	Rolling Forecast (Expanding Window)
	Forecast Horizon	1 week ahead (t + 1)
Preprocessing	Scaling Method	MinMax Normalization (0–1)/Standardization (Z-score)
	Missing Values	Forward Fill (if any)
	Seasonal Features	Sin/Cos (Annual Period 52) + Time Index
Model: Stage 1 (Baseline)	Architecture	Unidirectional GRU
	Hidden Size	32
	Dropout	0.5
	Input Features (Lag)	Lags [1, 2, 3]
Model: Stage 2 (Excess)	Architecture	Multi-Trend Attention + Bi-GRU
	Attention Heads	4
	Bi-GRU Hidden Size	32
	Bi-GRU Dropout	0.2
	Fusion Hidden	24
	Fusion Dropout	0.3
Training	Optimizer	Adam
	Learning Rate	$1 \times 10^{- 3}$
	Loss Function	MSE (Mean Squared Error)
	Batch Size	32
	Max Epochs (Stage 1)	100
	Max Epochs (Stage 2)	200
	Early Stopping Patience	S1: 15/S2: 25
	Random Seed	42 (Fixed for all experiments)
	Hardware	Standard GPU (e.g., NVIDIA V100 or RTX)

Table 3. Performance on AME (COVID-era, chronological 70/30 split). We report

R^{2}

and MAPE (%) as primary metrics and include parameter counts to highlight accuracy–efficiency trade-offs.

Table 3. Performance on AME (COVID-era, chronological 70/30 split). We report

R^{2}

and MAPE (%) as primary metrics and include parameter counts to highlight accuracy–efficiency trade-offs.

Method	$R^{2}$	MAPE (%)	# Params
Two-stage + multi-trend (Ours)	0.907	19.22	14,547
TCN	0.843	20.24	6369
Transformer	−0.020	60.00	23,519
GRU	−0.365	41.70	5143
LSTM	−0.542	50.60	7939
SARIMA	−0.483	152.00	16
ARIMA (5, 1, 2)	−1.908	221.30	8
Ridge + Features	−3.737	224.30	4

Table 4. Rolling-origin one-step-ahead evaluation on AME over a fixed test window (13 August 2022 to 8 November 2025;

H = 170

). All methods are evaluated on the same rolling protocol and test window. We report point metrics and PI calibration (80% for all models; 95% additionally shown for two-stage + multi-trend).

Table 4. Rolling-origin one-step-ahead evaluation on AME over a fixed test window (13 August 2022 to 8 November 2025;

H = 170

). All methods are evaluated on the same rolling protocol and test window. We report point metrics and PI calibration (80% for all models; 95% additionally shown for two-stage + multi-trend).

Method	RMSE	MAE	$R^{2}$	Cov@80 (%)	Width@80	Cov@95 (%)
Baselines
Naive	1.248	0.780	0.944	90.6	4.867	-
Seasonal Naive	5.212	3.631	0.032	80.6	12.833	-
Ridge (L6, best baseline)	0.993	0.637	0.965	88.2	2.148	-
Ridge (L12)	0.999	0.650	0.964	92.4	3.940	-
Ridge (L24)	1.005	0.658	0.964	94.1	4.460	-
Two-Stage Multi-Trend (Ours)
From-Scratch	1.829	1.285	0.881	89.4	5.997	98.2
Sliding-Window	1.952	1.352	0.864	90.0	5.807	95.9
Warm-Start	1.973	1.361	0.861	96.5	9.836	99.4

Table 5. Calibration of prediction intervals: empirical coverage vs. nominal coverage. Deviation = empirical − nominal.

Nominal Coverage	Empirical Coverage	Deviation
10%	20.0%	+10.0%
20%	30.0%	+10.0%
30%	30.0%	+0.0%
40%	40.0%	+0.0%
50%	50.0%	+0.0%
60%	60.0%	+0.0%
70%	80.0%	+10.0%
80%	100.0%	+20.0%
90%	100.0%	+10.0%
95%	100.0%	+5.0%

Table 6. Stage-1 baseline length sensitivity on the AME dataset. RMSE, MAE, and MAPE (%) are reported for the two-stage framework and two simple seasonal baselines.

Model	Baseline Length (Weeks)	RMSE	MAE	MAPE (%)
Two-Stage (Ours)	26	1.1172	0.8639	11.87
Seasonal Naive	26	5.6038	3.9092	72.78
Seasonal Regression	26	5.0876	4.0880	98.34
Two-Stage (Ours)	35	0.9316	0.7127	9.78
Seasonal Naive	35	5.6389	3.9673	73.44
Seasonal Regression	35	5.0902	4.0766	98.95
Two-Stage (Ours)	44	0.8147	0.7031	8.68
Seasonal Naive	44	5.7633	4.1369	75.74
Seasonal Regression	44	5.2671	4.3008	108.89

Table 7. Trend-count ablation on AME (Stage 2). Three trends provide the best generalization, whereas four trends lead to overfitting.

Trend Setting	$R^{2}$	MAPE (%)
1 trend	0.6517	54.02
2 trends	0.8562	30.86
3 trends (best)	0.9074	19.22
4 trends	0.1182	83.70

Table 8. Leave-one-feature-out ablation results for Google Trends keywords across three chronological sub-periods in the AME COVID-era segment. ‘All’ indicates no keyword is removed.

Period	Setting	Dropped Feature	RMSE	MAE	MAPE (%)
Early	All	-	0.8604	0.8602	88.55
Early	No-Flu	flu_Trends	0.8103	0.8100	83.38
Early	No-COVID	COVID_19_Trends	0.8152	0.8149	83.88
Early	No-Fever	fever_Trends	0.8082	0.8079	83.16
Early	No-Smell	loss_of_smell_Trends	0.7988	0.7985	82.19
Mid	All	-	0.0522	0.0452	33.36
Mid	No-Flu	flu_Trends	0.0385	0.0318	27.14
Mid	No-COVID	COVID_19_Trends	0.0335	0.0281	25.59
Mid	No-Fever	fever_Trends	0.0482	0.0400	28.81
Mid	No-Smell	loss_of_smell_Trends	0.0421	0.0355	26.53
Late	All	-	0.1558	0.1266	94.49
Late	No-Flu	flu_Trends	0.1440	0.1187	82.30
Late	No-COVID	COVID_19_Trends	0.1602	0.1320	103.79
Late	No-Fever	fever_Trends	0.1426	0.1144	74.22
Late	No-Smell	loss_of_smell_Trends	0.1531	0.1263	94.30

Table 9. Peak timing comparison between hospital RESP (weekly) and Google Trends (monthly) on AME. Positive values indicate that the Google Trends peak occurs earlier than the hospital peak. Because Google Trends is monthly for retrospective multi-year queries, timing differences have an inherent uncertainty of approximately ±2–4 weeks.

Wave	Signal	Trends Peak	Hospital Peak	Approx. Diff.
Delta	COVID-19	1 August 2021	4 September 2021	∼4 w earlier
Delta	Fever	1 September 2021	4 September 2021	∼0 w
Delta	Flu	1 October 2021	4 September 2021	∼4 w later
Omicron	COVID-19	1 January 2022	8 January 2022	∼0 w
Omicron	Fever	1 January 2022	8 January 2022	∼0 w
Omicron	Flu	1 December 2021	8 January 2022	∼4 w earlier

Table 10. Why Europe fails under a fixed pandemic-onset split and how adaptive splitting restores a baseline segment. Baseline CV/SNR are computed on the restored baseline segment (adaptive split).

Region	Baseline Weeks (Fixed Split)	Adaptive Split Date	Baseline CV/SNR (Adaptive)
Czechia	149	29 April 2024	1.43/0.70
Ireland	149	29 April 2024	0.64/1.56
Malta	149	29 April 2024	0.85/1.17
Slovakia	149	29 April 2024	1.00/1.00
Estonia	137	12 February 2024	0.84/1.19
France	101	29 May 2023	0.74/1.35
EU/EEA	91	20 March 2023	0.73/1.37
Hungary	43	18 March 2024	0.40/2.50

Table 11. European forecasting results after the improvement experiment (adaptive split + simpler baseline to reduce bias propagation). Remaining failures correlate with short baseline segments and high heterogeneity/noise.

Region	Baseline Weeks	Split Date	Excess $R^{2}$	Excess MAPE (%)
Estonia	137	12 February 2024	0.446	77.86
Czechia	149	29 April 2024	0.424	42.26
Ireland	149	29 April 2024	0.214	32.80
Malta	149	29 April 2024	0.299	54.52
EU/EEA	91	20 March 2023	0.286	90.96
Slovakia	149	29 April 2024	−0.393	49.05
Hungary	43	18 March 2024	−0.683	25.87
France	101	29 May 2023	−1.300	4960.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Meng, F.; Liu, H.; Bian, J. A Mechanism-Disentangled Two-Stage Forecasting Framework with Multi-Source Signal Fusion for Respiratory Hospitalizations. Electronics 2026, 15, 1656. https://doi.org/10.3390/electronics15081656

AMA Style

Li Z, Meng F, Liu H, Bian J. A Mechanism-Disentangled Two-Stage Forecasting Framework with Multi-Source Signal Fusion for Respiratory Hospitalizations. Electronics. 2026; 15(8):1656. https://doi.org/10.3390/electronics15081656

Chicago/Turabian Style

Li, Zhengze, Fanyu Meng, Haoxiang Liu, and Jing Bian. 2026. "A Mechanism-Disentangled Two-Stage Forecasting Framework with Multi-Source Signal Fusion for Respiratory Hospitalizations" Electronics 15, no. 8: 1656. https://doi.org/10.3390/electronics15081656

APA Style

Li, Z., Meng, F., Liu, H., & Bian, J. (2026). A Mechanism-Disentangled Two-Stage Forecasting Framework with Multi-Source Signal Fusion for Respiratory Hospitalizations. Electronics, 15(8), 1656. https://doi.org/10.3390/electronics15081656

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Mechanism-Disentangled Two-Stage Forecasting Framework with Multi-Source Signal Fusion for Respiratory Hospitalizations

Abstract

1. Introduction

2. Related Work

2.1. Respiratory Hospitalization Forecasting

2.2. Multi-Source Signals and Digital Epidemiology

2.3. Our Position Relative to Prior Work

3. Materials and Methods

3.1. Target Definition: Weekly Hospitalization Rate per 100,000 Population (Weekly New Admissions)

3.2. Model Architectures and Parameter Selection

3.2.1. Stage 1: Baseline Estimation

3.2.2. Stage 2: Multi-Trend Correction

3.2.3. Parameter Selection Rationale

3.3. Datasets, Signal Timing, and Preprocessing

3.3.1. Study Datasets and Regional Definitions

3.3.2. Signal Timing Analysis (Hospital vs. Google Trends)

3.3.3. Temporal Alignment and Preprocessing

3.4. Rolling-Origin Evaluation and Uncertainty Quantification

3.4.1. Rolling-Origin Protocol

3.4.2. Conformal Prediction Intervals

3.4.3. Evaluation Metrics

3.5. Datasets and Experimental Setup

3.5.1. Datasets and Splits

3.5.2. Baselines

3.5.3. Metrics

3.6. Interpretability and Signal Timing Analysis

3.7. Reproducibility Checklist

4. Results

4.1. Main Results on AME (U.S.)

4.2. Robustness: Rolling Forecast Evaluation on AME

4.3. Sensitivity of Stage-1 Baseline Length

4.4. Ablation: How Many Trends to Fuse?

4.5. Interpretability: Signal Timing and Dynamic Feature Focus

4.5.1. Key Constraint: Google Trends Is Monthly for Multi-Year Retrospective Queries

4.5.2. Real-Data Evidence: Wave-Level Alignment with Mixed Lead–Lag Across Keywords

4.5.3. Conceptual Illustration: Dynamic Feature Focus in Attention-Based Fusion (Optional)

4.6. Europe: Failure Analysis, Evidence, and an Improvement Experiment

5. Discussion

5.1. Clinical and Operational Relevance

5.2. Complexity Analysis

5.3. Why Baseline–Excess Decoupling Helps

5.4. Why Performance Drops in Some European Settings

5.5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI