Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning

Cao, Shu; Zhou, Can

doi:10.3390/sym17111788

Open AccessArticle

Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning

by

Shu Cao

¹ and

Can Zhou

^2,*

¹

Department of Business Administration, School of Economics and Management, Changsha University of Science and Technology, Changsha 410114, China

²

Control Science and Engineering, School of Automation, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(11), 1788; https://doi.org/10.3390/sym17111788

Submission received: 11 August 2025 / Revised: 20 September 2025 / Accepted: 2 October 2025 / Published: 22 October 2025

(This article belongs to the Special Issue Symmetry in Data Sciences and Machine Learning for Multidisciplinary Research)

Download

Browse Figures

Versions Notes

Abstract

Forecasting real-world time series in domains with strong event sensitivity and regional variability poses unique challenges, as predictive models must account for sudden disruptions, heterogeneous contextual factors, and structural differences across locations. In tackling these challenges, we draw on the concept of symmetry that refers to the balance and invariance patterns across temporal, multimodal, and structural dimensions, which help reveal consistent relationships and recurring patterns within complex systems. This study is based on two multimodal datasets covering 12 tourist regions and more than 3 years of records, ensuring robustness and practical relevance of the results. In many applications, such as monitoring economic indicators, assessing operational performance, or predicting demand patterns, short-term fluctuations are often triggered by discrete events, policy changes, or external incidents, which conventional statistical and deep learning approaches struggle to model effectively. To address these limitations, we propose an event-aware multimodal time-series forecasting framework with graph-based regional transfer built upon an enhanced PatchTST backbone. The framework unifies multimodal feature extraction, event-sensitive temporal reasoning, and graph-based structural adaptation. Unlike Informer, Autoformer, FEDformer, or PatchTST, our model explicitly addresses naive multimodal fusion, event-agnostic modeling, and weak cross-regional transfer by introducing an event-aware Multimodal Encoder, a Temporal Event Reasoner, and a Multiscale Graph Module. Experiments on diverse multi-region multimodal datasets demonstrate that our method achieves substantial improvements over eight state-of-the-art baselines in forecasting accuracy, event response modeling, and transfer efficiency. Specifically, our model achieves a 15.06% improvement in the event recovery index, a 15.1% reduction in MAE, and a 19.7% decrease in event response error compared to PatchTST, highlighting its empirical impact on tourism event economics forecasting.

Keywords:

multimodal time-series forecasting; regional indicator prediction; event-aware modeling; graph-based domain adaptation

1. Introduction

The increasing complexity of real-world socio-economic systems, driven by digital transformation, fragmented user behaviors, and frequent event-driven interventions, has created a growing demand for accurate and adaptive forecasting tools [1]. Organizations in diverse domains ranging from e-commerce platforms and service networks to regional administrations rely on data-driven insights to guide strategic planning, resource allocation, and operational optimization [2]. However, forecasting in such environments is inherently challenging due to the interplay of multimodal information sources, abrupt event-induced fluctuations, and substantial variability across geographic regions [3]. In this context, the concept of symmetry provides a unifying perspective for analyzing and designing forecasting systems. Temporal symmetry is manifested in seasonal cycles, periodic patterns, and mirrored fluctuations before and after high-impact events. Multimodal symmetry emerges when different data sources such as textual descriptions, images, and numerical indicators convey consistent semantic signals. Structural symmetry appears in inter-regional graphs, where geographically or functionally similar regions exhibit analogous interaction and influence patterns. By explicitly embedding these symmetry principles into the proposed framework, we enhance its ability to generalize across regions, align heterogeneous modalities more effectively, and improve interpretability by linking prediction outcomes to stable and recurring patterns.

Traditional time-series forecasting methods, while effective at capturing long-term trends, often struggle with short-term volatility triggered by policy changes, promotional campaigns, or shifts in public sentiment [4]. These approaches are typically restricted to unimodal numerical signals and fail to leverage the rich semantics contained in textual content (e.g., announcements, descriptions, or feedback) and visual data (e.g., marketing creatives or situational imagery) [5]. Although recent advances in multimodal deep learning offer the potential to fuse heterogeneous data streams, they often face limitations in fine-grained temporal alignment, event-level interpretability, and adaptation to region-specific structural differences, where response patterns can vary markedly [6]. However, prior studies such as ARIMA, LSTM, Informer, Autoformer, and PatchTST have clear deficiencies: they either fail to integrate multimodal data, neglect event-specific volatility, or lack cross-regional adaptability. These gaps motivate our research. Therefore, the research gap can be summarized as follows: (1) prior models lack explicit mechanisms that link multimodal signals with discrete event dynamics; (2) event-driven volatility is often under-modeled; (3) cross-regional transfer is insufficiently addressed. To address these issues, we propose a novel event-aware multimodal forecasting framework (as illustrated in Figure 1) that integrates a Multimodal Encoder, a Temporal Event Reasoner, and a Multiscale Graph Relevance Module into a unified architecture. Built upon the strong PatchTST backbone [7], the framework integrates temporal modeling, multimodal semantic fusion, and structural transferability within a unified architecture. It incorporates a Multimodal Encoder (EME) for adaptive fusion of temporal, textual, and visual inputs; a Temporal Event Reasoner (TER) for capturing high-impact time windows associated with events such as interventions, campaigns, or seasonal surges; and a Multiscale Graph Relevance Module (MGRM) for modeling inter-regional structural correlations to improve transferability and robustness. Extensive experiments on diverse multi-region multimodal datasets demonstrate that the proposed method not only achieves superior predictive accuracy compared with state-of-the-art baselines but also exhibits resilience under missing or noisy modality conditions. Furthermore, it adapts effectively to unseen regions with minimal fine-tuning and provides interpretable outputs through modality contribution analysis and attention visualization, enhancing trust and usability. This work advances the state of the art in multimodal time-series forecasting by addressing the intertwined challenges of event sensitivity, semantic richness, and cross-region generalization while offering practical implications for real-world decision-making in dynamic, data-rich environments.

2. Related Work

2.1. Time-Series Forecasting in Socio-Economic and Behavioral Modeling

Time-series forecasting has long been a cornerstone in economic analysis, demand planning, and behavioral modeling. Classical statistical approaches such as ARIMA and exponential smoothing remain popular for short-term prediction due to their simplicity and interpretability [8]. However, these methods are limited in capturing nonlinear temporal dependencies, especially under non-stationary conditions or in the presence of event-induced disruptions [9]. Recent advances in deep learning, particularly Transformer-based architectures such as Informer [10], Autoformer [11], and PatchTST [7], have demonstrated strong capabilities in modeling long-term dependencies through flexible attention mechanisms. Nevertheless, these models predominantly operate on unimodal numerical data and lack mechanisms for incorporating semantically rich information from complementary modalities such as textual descriptions or visual content [12,13,14,15,16].

2.2. Multimodal Learning for Forecasting and Decision Support

Multimodal learning provides a powerful framework for integrating heterogeneous data sources including time series, texts, and images to improve predictive performance in domains such as finance, retail, and human behavior modeling [7]. Approaches such as CLIP [17], BEVT [18], and EVAD [19] have shown that vision–language pretraining and cross-modal fusion can enhance semantic understanding in classification, retrieval, and recommendation tasks. However, these methods often rely on static feature aggregation and do not explicitly address temporal alignment or event-driven fluctuations [20]. In many real-world forecasting scenarios, semantic signals evolve over time and interact dynamically with underlying behavioral or economic processes, requiring models that jointly capture multimodal dependencies and temporal sensitivity—capabilities that remain underdeveloped in existing frameworks [21,22,23,24,25,26,27].

2.3. Event-Aware and Volatility-Sensitive Forecasting

Abrupt temporal changes triggered by policy interventions, campaigns, holidays, or unforeseen external events present persistent challenges for forecasting models [28]. Neglecting such volatility often leads to underestimation of uncertainty and suboptimal decision-making. While recent works have introduced volatility-aware modules [29] and attention mechanisms to improve responsiveness, few explicitly focus on identifying, localizing, and modeling high-impact event windows [30]. Our proposed Temporal Event Reasoner (TER) addresses this gap by selectively amplifying temporal segments associated with significant events, thereby enhancing both predictive accuracy and responsiveness in event-intensive contexts.

2.4. Graph-Based Transfer Learning and Regional Adaptation

Forecasting systems in practice often need to generalize across diverse regions, domains, or structural environments. Graph-based learning methods, such as GraphCast [31] and spatio-temporal graph neural networks [32], have proven effective in capturing inter-entity dependencies for applications like weather forecasting and traffic prediction. These methods leverage relational priors to support cross-domain knowledge transfer [33], yet their integration into multimodal and event-aware forecasting remains limited [34]. Our Multiscale Graph Relevance Module (MGRM) builds on this foundation by learning latent structural correlations between regions, enabling efficient and robust cross-region adaptation with minimal fine-tuning.

While significant progress has been made in time-series forecasting, multimodal learning, and graph-based modeling, existing research typically addresses these aspects in isolation [12,35,36,37,38,39]. Few studies offer a unified approach that dynamically integrates heterogeneous modalities, adapts to event-driven volatility, and transfers effectively across diverse regions. Our work bridges these gaps by combining Transformer-based temporal modeling, multimodal semantic fusion, and graph-structured regional adaptation into a single interpretable and high-performing framework for event-aware multimodal time-series forecasting.

Our empirical findings extend the prior literature in three ways: First, unlike decomposition-focused approaches such as Autoformer [11] or ETS-inspired models [40] that primarily target temporal regularities, our model explicitly incorporates event semantics and conditional attention, improving event-window responsiveness. Second, vision–language pretraining works (CLIP [17] and BEVT [18]) demonstrate modality alignment benefits. We build on these ideas but emphasize time-localized alignments via patch-level contrastive objectives, which are crucial for forecasting. Third, while graph-based methods (e.g., GraphCast-style or spatio-temporal GNNs) capture structural dependencies, our MGRM unifies multiscale graphs with event-conditioned temporal modeling, enabling cross-region transfer, as demonstrated in the next section.

3. Method

3.1. Event-Aware Multimodal Representation Learning

Tourism event economics is inherently multimodal: economic indicators such as revenue, occupancy, and tourist flow are influenced not only by historical trends but also by heterogeneous event-related signals, including visual campaigns, social media narratives, and structured promotional metadata. Traditional time-series forecasting models, including PatchTST, treat the problem as purely numerical sequence modeling, ignoring high-impact semantic cues. To address this, we propose the event-aware Multimodal Encoder (EME) shown in Figure 2, which integrates multimodal event information into temporal embeddings via contrastive alignment and residual fusion. This ensures that only relevant multimodal signals contribute to forecasting, improving robustness and interpretability.

To make the methodology more explicit, the EME module can be summarized as a three-step procedure:

Patch-level temporal encoding segments multivariate time series into fixed-length patches and projects them into latent representations;
Multimodal embedding extracts visual, textual, and structured event features and projects them into a unified latent space;
Alignment and fusion enforce semantic consistency through contrastive loss and integrate multimodal signals into temporal embeddings via residual fusion.

This explicit workflow clarifies how numerical and event-driven signals are combined before entering the forecasting backbone.

Multimodal Patch Embedding.

Let the multivariate time series be

X = [x_{1}, x_{2}, \dots, x_{T}] \in R^{T \times D},

(1)

where

T is the total number of time steps;
D is the feature dimension (e.g., visitor count, revenue, and click-through rate);
$x_{t} \in R^{D}$ denotes the feature vector at time step t.

The series is segmented into fixed-length patches as follows:

P_{j} = [x_{(j - 1) s + 1}, \dots, x_{j s}] \in R^{s \times D}, j = 1, \dots, N,

(2)

where s is the patch length and

N = ⌊ T / s ⌋

is the total number of patches. Each patch is projected to a latent representation via a patch encoder

f_{patch}

:

h_{j} = f_{patch} (P_{j}) \in R^{d_{z}},

(3)

where

d_{z}

is the embedding dimension. The vector

h_{j}

encodes local temporal dynamics of patch

P_{j}

, serving as the foundation for multimodal fusion.

For each patch, the multimodal event context includes an image

I_{j}

, a short text

T_{j}

, and structured features

b_{j} \in R^{d_{b}}

. Each modality is embedded as

Image: $v_{j} = f_{ViT} (I_{j}) \in R^{d_{v}}$ , extracted by a Vision Transformer to capture visual semantics;
Text: $t_{j} = f_{BERT} (T_{j}) \in R^{d_{t}}$ , using a pre-trained language model to encode the textual context;
Structured features: $m_{j} = f_{meta} (b_{j}) \in R^{d_{m}}$ , processed by a small MLP to represent event metadata.

These modality-specific embeddings are concatenated and projected into a unified event-aware embedding:

z_{j} = f_{proj} ([v_{j}; t_{j}; m_{j}]) \in R^{d_{z}},

(4)

where

[v_{j}; t_{j}; m_{j}]

denotes concatenation and

f_{proj}

is a learnable projection ensuring alignment in the latent space of dimension

d_{z}

.

To enforce semantic consistency between numerical patches and event embeddings, we employ the following contrastive loss function:

L_{align} = - \sum_{j = 1}^{N} log \frac{exp (sim (h_{j}, z_{j}) / τ)}{\sum_{k = 1}^{N} exp (sim (h_{j}, z_{k}) / τ)},

(5)

where

sim (a, b) = \frac{a^{⊤} b}{∥ a ∥ ∥ b ∥}

(6)

is the cosine similarity and

τ

is a temperature hyperparameter controlling the sharpness of the distribution. This objective pulls matched patch–event pairs together while pushing apart mismatched ones, enhancing the relevance of multimodal signals.

The final patch embedding incorporates event information using residual fusion:

H_{j} = h_{j} + γ \cdot z_{j},

(7)

where

γ

is a learnable scalar weighting that contributes to multimodal semantics. This residual design maintains the original temporal dynamics while enriching embeddings with event-aware information. The resulting representation

H_{j}

is then fed into the PatchTST backbone for forecasting.

The total loss combines the primary forecasting loss with the contrastive alignment term:

L_{total} = L_{forecast} + λ \cdot L_{align},

(8)

where

λ

balances the importance of semantic alignment relative to forecasting accuracy.

The EME module explicitly follows three stages to enhance time-series forecasting with the event-driven context, the Algorithm 1 is shown as follows: 1. Patch-level temporal encoding captures local temporal dynamics within fixed-length segments of the numerical time series. 2. Multimodal embedding projects heterogeneous event signals (images, texts, and metadata) into a unified latent space, allowing semantic alignment with numerical patches. 3. Contrastive alignment and residual fusion ensure that only relevant multimodal information contributes to the final representation, preserving original temporal patterns while enriching them with event semantics. The total loss combines forecasting accuracy with semantic alignment, making the module reproducible and transparent. Each step corresponds to a specific methodological design choice, facilitating replication and adaptation to other event-driven forecasting tasks.

Algorithm 1: Event-aware Multimodal Embedding (EME) module

3.2. Temporal Event Reasoning Module

The framework of this module, as shown in Figure 3, aims to explicitly model the impact of discrete events on temporal dynamics, capturing abrupt shifts and latent correlations induced by heterogeneous events. To ensure clarity and reproducibility, the Temporal Event Reasoning (TER) module can be understood as a three-step process:

Event-conditioned temporal embedding enriches standard positional encodings with event-specific information (category, distance, and intensity);
Event-aware attention adapts attention weights by jointly considering temporal similarity and event similarity;
Event regularization enforces discriminative event embeddings by penalizing correlations across unrelated events.

Figure 3. Temporal Event Reasoning with event-aware attention and semantic gating. The module explicitly integrates event embeddings into temporal dynamics through (i) event-conditioned temporal embedding, (ii) event-aware attention, and (iii) event regularization.

This workflow highlights how event semantics are progressively injected into the forecasting backbone.

Let

{e_{j}}_{j = 1}^{N}

denote the sequence of event embeddings corresponding to temporal patches, where each embedding encodes event-specific characteristics:

e_{j} = EventEmbed (c_{j}, δ_{j}, s_{j}) \in R^{d_{e}},

(9)

where

$c_{j} \in R^{d_{c}}$ is a one-hot or learned vector representing the event category (e.g., promotion, crisis, and holiday);
$δ_{j} \in R$ denotes the relative temporal distance of the current time step to the peak or occurrence of the event;
$s_{j} \in R$ encodes the event intensity (e.g., budget, scale, or exposure);
$d_{e}$ is the dimensionality of the event embedding space.

The standard positional embedding

{PE}_{j}

is augmented with the event embedding to produce an event-aware temporal representation:

{\tilde{PE}}_{j} = {PE}_{j} + W_{e} e_{j}, W_{e} \in R^{d_{z} \times d_{e}},

(10)

where

W_{e}

projects the event embedding into the same latent space as the temporal patch embedding (

d_{z}

). This fusion allows the model to attend to temporal dynamics conditioned on event semantics.

To model the influence of events on temporal dependencies, we introduce an event-aware attention mechanism. The attention weights

α_{i j}

between patches i and j are computed as follows:

α_{i j} = softmax (q_{i}^{⊤} k_{j} + β \cdot e_{i}^{⊤} W_{sim} e_{j}),

(11)

where

$q_{i} = W_{q} H_{i}$ and $k_{j} = W_{k} H_{j}$ are the query and key vectors for temporal embeddings $H_{i}$ and $H_{j}$ ;
$V_{j} = W_{v} H_{j}$ is the value vector;
$W_{sim} \in R^{d_{e} \times d_{e}}$ is a learnable similarity matrix capturing event correlations;
$β$ is a hyperparameter controlling the relative influence of event similarity;
$softmax$ ensures the attention weights sum to 1.

The event-modulated temporal representation is then obtained as a weighted sum of value vectors, where

{\tilde{H}}_{i} = \sum_{j = 1}^{N} α_{i j} \cdot V_{j} .

(12)

This operation allows the model to selectively integrate information from temporally and semantically relevant patches, enhancing forecasting under event-driven perturbations.

To prevent overfitting to unrelated events and enforce disentanglement, we introduce an orthogonality-based event regularization function, where

L_{event - reg} = \sum_{(i, j) : c_{i} \neq c_{j}} {(e_{i}^{⊤} e_{j})}^{2} .

(13)

This term penalizes high similarity between embeddings of different event categories, promoting discriminative event representations.

The overall loss function for the Temporal Event Reasoning module combines the forecasting objective, cross-modal alignment, and event regularization:

L_{TER} = L_{forecast} + λ_{1} L_{align} + λ_{2} L_{event - reg},

(14)

where

$L_{forecast}$ is the primary prediction loss (e.g., mean squared error between predicted and actual values);
$L_{align}$ enforces cross-modal embedding alignment;
$λ_{1}$ and $λ_{2}$ are hyperparameters balancing the contributions of alignment and event regularization.

Algorithm 2: Temporal Event Reasoning (TER) module

The TER module explicitly integrates event semantics into temporal modeling through three key steps, the Algorithm 2 is shown as follows: 1. Event-conditioned temporal embedding: Positional encodings are augmented with event-specific features (category, temporal distance, and intensity) to allow the model to condition its temporal representation on events. 2. Event-aware attention: Standard attention is modulated with event similarity, enabling the model to focus on patches that are both temporally and semantically relevant. 3. Event regularization: An orthogonality-based penalty ensures that embeddings of different event categories remain discriminative, preventing overfitting to unrelated events. By combining these steps with the forecasting and cross-modal alignment losses, TER produces event-modulated temporal embeddings that improve robustness, interpretability, and reproducibility.

3.3. Multiscale Graph Reasoning Module

To capture interdependencies among the economic indicators, event context, and semantic signals, we propose a Multiscale Graph Reasoning Module (MGRM), as illustrated in Figure 4. This module explicitly models relational structures over multiple temporal scales and propagates event-aware information through graph-based representations. To ensure clarity and reproducibility, the workflow of the MGRM can be summarized as follows:

Graph construction builds heterogeneous graphs at each time step, with nodes representing economic, event, and semantic signals;
Graph attention propagation applies relational graph attention to exchange information across nodes and relation types;
Hierarchical fusion combines daily, weekly, and seasonal graph embeddings through adaptive weighting;
Patch integration merges graph-enhanced features with patch embeddings via gated fusion for downstream forecasting.

Figure 4. Multiscale Graph Reasoning via relational graph attention and hierarchical fusion. The module integrates (i) heterogeneous graph construction, (ii) relational graph attention propagation, and (iii) multiscale hierarchical fusion with gated patch integration.

Heterogeneous Graph Formulation.

At each time step t, we construct a heterogeneous graph:

G_{t} = (V, E_{t}),

(15)

where

$V$ denotes the set of nodes, including economic, event, and semantic nodes;
$E_{t}$ denotes typed edges representing relations among nodes at time t;
$h_{i}^{(0)} \in R^{d_{z}}$ is the initial feature vector of node i, initialized from the corresponding patch or event embeddings from EME/TER.

Relational Graph Attention.

To propagate information across heterogeneous nodes, we apply relational graph attention:

h_{i}^{(l + 1)} = σ (\sum_{r \in R} \sum_{j \in N_{i}^{r}} α_{i j}^{r} W_{r} h_{j}^{(l)}),

(16)

where

l denotes the layer index;
$R$ is the set of relation types;
$N_{i}^{r}$ is the neighborhood of node i under relation r;
$W_{r} \in R^{d_{z} \times d_{z}}$ is a learnable transformation for relation r;
$σ (\cdot)$ is a nonlinear activation (e.g., ReLU);
$α_{i j}^{r}$ is the attention weight for node j with respect to node i under relation r, which is computed as follows:

α_{i j}^{r} = \frac{exp (LeakyReLU (a_{r}^{⊤} [W_{r} h_{i}^{(l)} ∥ W_{r} h_{j}^{(l)}]))}{\sum_{k \in N_{i}^{r}} exp (LeakyReLU (a_{r}^{⊤} [W_{r} h_{i}^{(l)} ∥ W_{r} h_{k}^{(l)}]))},

(17)

where

a_{r} \in R^{2 d_{z}}

is a learnable attention vector for relation r and

∥ \cdot ∥

denotes vector concatenation.

Hierarchical Multiscale Fusion.

To capture temporal patterns at multiple resolutions, we build graphs for daily (d), weekly (w), and seasonal (s) scales:

H^{graph} = \sum_{u \in {d, w, s}} λ_{u} \cdot mean (H^{(u)}),

(18)

where

H^{(u)}

is the node embedding matrix at scale u, and the attention-based fusion weight is

λ_{u} = \frac{exp (q^{⊤} tanh (W_{u} H^{(u)}))}{\sum_{v} exp (q^{⊤} tanh (W_{v} H^{(v)}))} .

(19)

Here,

W_{u} \in R^{d_{z} \times d_{z}}

is a learnable projection for scale u, while

q \in R^{d_{z}}

is a global query vector that adaptively selects relevant temporal scales.

Gated Fusion with Patch Embeddings.

We integrate graph reasoning outputs with patch embeddings from EME via gated fusion:

{\hat{H}}_{j} = H_{j} + η \cdot H_{j}^{graph}, η \in [0, 1],

(20)

where

H_{j}

is the patch embedding,

H_{j}^{graph}

is the corresponding graph-enhanced representation, and

η

is a learnable scalar that controls graph influence. This ensures that temporal dynamics (EME) and relational reasoning (MGRM) are effectively combined.

To encourage relational coherence and prevent overfitting to noisy edges, we use the following equation:

L_{smooth} = \sum_{(i, j) \in E} ω_{i j} {∥ h_{i} - h_{j} ∥}_{2}^{2},

(21)

where

ω_{i j}

is the edge weight and

{∥ \cdot ∥}_{2}

is the Euclidean norm.

The final optimization objective combines forecasting, multimodal alignment, event regularization, and graph smoothness, with

L_{total} = L_{forecast} + λ_{1} L_{align} + λ_{2} L_{event - reg} + λ_{3} L_{smooth},

(22)

where

λ_{1}, λ_{2},

and

λ_{3}

balance the contributions of different regularizations.

The MGRM module extends the EME and TER modules by introducing relational reasoning across multiple temporal scales, the Algorithm 3 is shown as follows: 1. Heterogeneous graph construction: Nodes represent economic, event, and semantic features, and edges capture typed interdependencies. 2. Relational graph attention propagation: Attention is computed per relation type, allowing selective information exchange across heterogeneous nodes while maintaining the relational structure. 3. Hierarchical multiscale fusion: Embeddings from daily, weekly, and seasonal graphs are adaptively fused to capture patterns at multiple temporal resolutions. 4. Gated patch integration: Graph-enhanced embeddings are combined with original patch embeddings from EME to maintain local temporal dynamics. 5. Graph smoothness regularization: It encourages coherence among connected nodes, mitigating overfitting to noisy edges. The resulting embeddings (

{\hat{H}}_{j}

) integrate temporal, multimodal, and relational reasoning information, providing a transparent, reproducible, and interpretable representation for accurate event-aware forecasting.

Algorithm 3: Multiscale Graph Reasoning (MGRM) module

3.4. Method Summary

The proposed framework integrates three complementary modules to achieve robust and interpretable event-aware forecasting. The process begins with preprocessing and alignment of multimodal and temporal data, where multivariate time series are segmented into fixed-length patches and aligned with heterogeneous event signals, including images, texts, and structured metadata. These patches are then encoded through the event-aware Multimodal Encoder (EME), which performs contrastive alignment between numerical and event embeddings and fuses the resulting multimodal information with temporal representations via residual connections. This ensures that temporal patches are enriched with relevant event semantics while preserving the original temporal dynamics.

Following the multimodal encoding, event semantics are explicitly injected into the temporal dependencies using the Temporal Event Reasoning (TER) module. In this stage, event embeddings representing the category, temporal distance, and intensity are incorporated into positional encodings, and attention mechanisms are modulated by both temporal and event similarity. Orthogonality-based regularization is applied to enforce disentangled and discriminative event representations, enhancing the model’s sensitivity to abrupt shifts and latent correlations induced by heterogeneous events.

Finally, the Multiscale Graph Reasoning Module (MGRM) captures global dependencies and relational structures across multiple temporal scales. Heterogeneous graphs are constructed for daily, weekly, and seasonal patterns, with nodes representing economic, event, and semantic signals. Relational graph attention propagates information across nodes and relation types, and hierarchical fusion adaptively integrates information from different scales. The graph-enhanced features are then combined with temporal embeddings using gated fusion, providing a coherent representation for downstream forecasting. The overall optimization simultaneously considers forecasting accuracy, multimodal alignment, event disentanglement, and graph smoothness, forming a transparent and reproducible pipeline from raw data to final prediction.

This algorithm provides an end-to-end and continuous workflow, the Algorithm 4 is shown as follows:

EME integrates multimodal signals into patch-level embeddings using contrastive alignment and residual fusion.
TER injects event semantics directly into temporal dependencies via event-aware attention and regularization.
MGRM performs relational reasoning across heterogeneous graphs at multiple temporal scales and fuses the graph-enhanced features with original embeddings.
Optimization combines forecasting, multimodal alignment, event disentanglement, and graph smoothness into a unified loss function for end-to-end training.

Algorithm 4: End-to-end event-aware Multimodal forecasting framework

The framework ensures transparent module interactions, reproducibility, and interpretability by mapping directly to the components analyzed in ablation studies.

4. Experimental Results

4.1. Implementation and Evaluation Protocol

To ensure methodological rigor and reproducibility, all experiments follow a unified and transparent evaluation protocol. The design explicitly specifies (i) the datasets and preprocessing steps, (ii) the training and evaluation pipeline, and (iii) the statistical testing methods used to validate significance.

Datasets and Preprocessing

We conduct experiments on multiple real-world tourism and economic datasets that include heterogeneous modalities (numerical indicators, event metadata, texts, and images). All time series are normalized using z-score normalization. Missing values are imputed with temporal interpolation, and multimodal inputs are temporally aligned within a fixed forecasting window. This preprocessing ensures comparability across models and replicability of results.

Evaluation Protocol

We adopt rolling-origin (walk-forward) cross-validation with

K = 5

origins. For each origin k, the model is trained on all data up to

t_{train}^{(k)}

, validated on the next validation window, and tested on the subsequent forward window. Each origin is further trained with

R = 3

random seeds ({0, 1, 2}), resulting in

K \times R = 15

independent runs. Final results are reported as the mean ± standard deviation, thereby capturing both temporal variability and stochasticity.

Statistical Testing

To verify statistical significance, we apply paired two-tailed t-tests for metrics satisfying normality assumptions, and Wilcoxon signed-rank tests otherwise. In addition, we report 95% bootstrap confidence intervals (1000 resamples) for MAE and RMSE to ensure that observed differences are robust and not due to randomness. Horizon-level forecast comparisons additionally include Diebold–Mariano (DM) tests to assess predictive accuracy differences between competing models.

Implementation Details.

All experiments are implemented in PyTorch 2.1.0., and key hyperparameter configurations are listed in the following Listing 1. Pretrained models (ViT and BERT) are fine-tuned unless otherwise specified in ablation studies. Code and scripts will be made available upon acceptance to guarantee reproducibility and facilitate extension to related tasks.

Listing 1. PyTorch-style training loop for the proposed forecasting framework.

Ablation and Robustness Studies.

To demonstrate the contribution of each module and the robustness of the framework, we conduct the following controlled experiments:

Module ablation: We disable/replace each component (EME, TER, and MGRM), evaluate EME variants (image only, text only, and metadata only), and test freezing vs. fine-tuning pretrained ViT/BERT.
Robustness: We inject Gaussian noise at multiple SNR levels for numeric inputs, apply random image corruptions (blur and JPEG compression), mask modalities during inference, and conduct leave-one-region-out transfer evaluations.
Hyperparameter sensitivity: We sweep key parameters (contrastive weight $λ_{1}$ , graph regularization $λ_{2}$ , patch length s), and report performance trends.

For each run, we compute the MAE and RMSE across forecast horizons. Aggregated tables report the mean ± std and 95% bootstrap confidence intervals. Pairwise model comparisons use paired t-tests or Wilcoxon signed-rank tests when normality assumptions are violated. This explicit evaluation protocol ensures that reported results are not only numerically superior but also statistically reliable and reproducible.

4.2. Datasets

To ensure reproducibility, we provide detailed dataset descriptions:

TravelEventOps is constructed from a large-scale online travel agency (OTA), including
- Daily KPIs such as the tourist volume, order count, revenue, and occupancy rate;
- Event signals such as campaign banners (images), promotional texts, and structured budget/channel metadata;
- Event logs such as campaign start/stop timestamps and category tags;
- Geographic coverage, with 12 major tourist regions and 180 days of multimodal records;
TourismGraph-22 comprises
- Multiregional event networks derived from advertisement co-exposure and visitor flow transitions;
- Weekly seasonal and holiday-sensitive KPIs for 3 years (156 weeks);
- Event metadata such as campaign intensity, promotion types, and associated economic outcomes.

All datasets undergo preprocessing for temporal alignment, event identity resolution, and missing value handling (interpolation for numerical, zero-padding for categorical).

4.3. Experimental Setup

The input window is chosen according to dataset characteristics: 48 days for TravelEventOps and 52 weeks for TourismGraph-22. Forecast horizons are set to 7 days and 1 week, respectively. Hierarchical attention fusion is used to integrate multiscale graph representations. The patch embedding dimension is consistent across modules to maintain unified representation flow. Detailed training configurations and implementation settings for all experiments are presented in Table 1, ensuring the reproducibility of our results and facilitating subsequent research extensions.

4.4. Evaluation Metrics

Mean Absolute Error (MAE):

MAE = \frac{1}{n} \sum_{i = 1}^{n} |{\hat{y}}_{i} - y_{i}|,

(23)

measuring the average absolute deviation between predictions

{\hat{y}}_{i}

and ground truth

y_{i}

.

Root Mean Squared Error (RMSE):

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}},

(24)

penalizing larger deviations.

Event-aware Relative Improvement (BRI):

BRI = \frac{{MAE}_{baseline} - {MAE}_{model}}{{MAE}_{baseline}} \times 100 %,

(25)

quantifying the improvement over baseline (PatchTST) particularly for event-impacted periods.

Event Response Error (ERE):

ERE = \frac{1}{| E |} \sum_{(t_{e}, t_{e} + τ) \in E} |{\hat{y}}_{t} - y_{t}|,

(26)

where

E

denotes all time windows affected by events, capturing forecast fidelity during event-driven deviations.

4.5. Result and Discussion

Table 2 reports the forecasting performance of the proposed model in comparison with eight strong baselines, evaluated under a 5-fold cross-validation protocol with mean and standard deviation values. The results show that our model achieves the lowest error rates across all metrics, with the MAE reduced from

0.704 \pm 0.016

for PatchTST to

0.598 \pm 0.012

and the RMSE reduced from

0.954 \pm 0.027

to

0.803 \pm 0.020

. Similarly, the Event Response Error (ERE) decreases from

0.736 \pm 0.015

to

0.592 \pm 0.011

, while the relative improvement over PatchTST (BRI) reaches +15.06%. These consistent improvements across multiple evaluation metrics demonstrate the model’s superior capacity to capture both baseline dynamics and event-induced fluctuations. To further confirm the robustness of these gains, we performed paired two-tailed Student’s t-tests across all cross-validation folds. The reductions in MAE and RMSE compared with PatchTST and BEVT-MOE are statistically significant (

p < 0.01

and

p < 0.05

, respectively). In addition, 95% bootstrap confidence intervals obtained from 1000 resamples consistently place the proposed model below the lower bound of competing baselines, verifying that the observed performance differences are not attributable to random variance.

Beyond numerical significance, the improvements are practically meaningful. For instance, in the TravelEventOps dataset, a reduction of 0.106 in MAE translates into more precise daily revenue forecasts at the million-RMB scale, which directly benefits tourism policy decisions and campaign planning. Compared with advanced multimodal baselines such as CLIP, GraphCast, EVAD, and BEVT-MOE, the proposed method continues to outperform despite its use of sophisticated vision–language or graph-based representations. In particular, although BEVT-MOE integrates a mixture-of-experts architecture with hierarchical embeddings, our approach surpasses it by 5.2% in MAE and 6.3% in RMSE. This performance advantage arises from the event-aware Multimodal Encoder and Temporal Event Reasoner, which jointly capture campaign-sensitive patterns, while the Multiscale Graph Relevance module enhances global consistency by aligning region-specific event structures.The improvements in Table 2 directly reflect the methodological innovations introduced in Section 3. Specifically, the event-aware Multimodal Encoder (EME) reduces the MAE/RMSE by enhancing multimodal fusion, while the Temporal Event Reasoner (TER) significantly lowers the Event Response Error, and the Multiscale Graph Reasoning Module (MGRM) improves cross-region transferability. This explicit alignment confirms that each methodological step contributes independently and synergistically to the overall performance.

The robustness of the model’s predictive accuracy under campaign-induced disturbances is further illustrated in Figure 5, which compares predicted trajectories against the observed ground truth. While traditional baselines such as Autoformer and Informer show partial alignment during mid-period intervals, they often fail to capture abrupt seasonal shifts, and CLIP in particular exhibits lagged responses to sudden promotional spikes. In contrast, our model closely follows the ground truth curve even under sharp fluctuations, demonstrating stronger adaptability to exogenous event shocks. Complementary evidence is provided by Figure 6, which presents the residual distributions of PatchTST and our model. Although both residual series are centered around zero, indicating no systematic bias, the distribution for our model is significantly narrower and more sharply peaked. This suggests reduced variance in prediction errors and a lower frequency of extreme deviations. A Kolmogorov–Smirnov test confirms that the difference in residual dispersion between the two models is statistically significant (

p < 0.01

), further reinforcing the reliability of our approach. Taken together, the evidence from Table 2, Figure 5 and Figure 6 demonstrates not only descriptive superiority but also statistically validated improvements in forecasting accuracy and robustness.

5. Ablation Study and Robustness Analysis

The ablation results in Table 3 clearly demonstrate that each module contributes meaningfully to the overall model performance. Removing the event-aware Multimodal Embedding (EME) module leads to an increase in MAE from

0.598 \pm 0.012

to

0.666 \pm 0.015

, indicating that multimodal feature integration is crucial for capturing event-specific information. A paired t-test across five folds confirms that this degradation is statistically significant (

p < 0.01

, 95% CI = [0.052, 0.084]). Excluding the Temporal Event Reasoning (TER) module results in an increase in ERE from

0.592 \pm 0.011

to

0.645 \pm 0.013

, demonstrating that the temporal context and event-conditioned reasoning are essential for accurate forecasting of event-related fluctuations, with significance verified at the

p < 0.05

level.

Similarly, removing the Multiscale Graph Reasoning module (MGRM) leads to noticeable degradation in RMSE and long-range stability (

0.803 \pm 0.020

vs

0.835 \pm 0.022

), reflecting the loss of structural priors from event interaction graphs and seasonal co-movement patterns. A Wilcoxon signed-rank test confirms that this difference is significant (

p < 0.05

). These results provide a direct correspondence between the methodological design of each module and its contribution to predictive performance. Robustness analyses, including Gaussian noise injection and modality ablations (image or text removal), further show that the model exhibits graceful degradation: performance drops are limited (<10%), and multimodal inputs consistently outperform unimodal settings (

p < 0.01

). This highlights that each module not only improves accuracy but also enhances model robustness and reduces over-reliance on individual modalities.

Turning to Table 4, the parameter sensitivity analysis confirms the stability of the framework across a range of hyperparameters. For contrastive loss weight, the optimal setting (

α = 0.5

) significantly outperforms

α = 0.0

and

α = 1.0

in terms of MAE (mean difference >

0.04

,

p < 0.05

), though the effect size remains moderate (Cohen’s

d \approx 0.45

). For graph regularization weight (

λ_{2}

) and patch length (

l_{p}

), differences across tested values fall within overlapping 95% CIs, and significance tests indicate no statistically significant degradation, suggesting strong robustness to hyperparameter variation. However, when both image and text inputs are absent, performance degrades significantly, confirming the value of the multimodal event-aware context. Gaussian perturbation on time series leads to moderate robustness loss, demonstrating the temporal encoder’s resistance to noisy signals. Furthermore, we provide parameter sensitivity analysis (contrastive loss weight, graph regularization, and patch length), showing stability across a range of hyperparameters. This ensures that reviewers can see how our framework generalizes across parameter choices.

Figure 7 presents the temporal dynamics of prediction error (MAE) within a ±7-day window centered around the occurrence of event-related events, comparing three model variants: the proposed model, its ablated version without the TER module, and a strong baseline (PatchTST). This visualization evaluates the model’s event sensitivity and temporal alignment capability. Notably, the proposed model demonstrates a clear error dip immediately surrounding the event day (day 0), suggesting its ability to preemptively adjust predictions in response to upcoming disruptions such as promotions, public holidays, or crises. In contrast, the ablated model without the TER module shows a flatter error curve, indicating a failure to account for temporal volatility introduced by events. PatchTST, while slightly reactive near the event day, exhibits delayed response behavior, likely because it lacks explicit event modeling mechanisms. These results underscore the effectiveness of the Temporal Event Reasoning module in capturing temporal perturbations and translating them into adaptive forecasting adjustments. The declining pre-event MAE (from day

- 3

onward) suggests that the model learns early signals or leading indicators preceding the event, such as pre-campaign engagement surges or user anticipation behavior. This temporal asymmetry highlights the model’s proactive and anticipatory nature, which is crucial for downstream applications like inventory allocation, ad budget planning, and crisis risk mitigation.

Figure 8 illustrates the relationship between the strength of the graph regularization term and the smoothness of prediction residuals over the underlying data graph. Smoothness is quantified using local variance or Laplacian-based metrics over graph neighborhoods, reflecting the consistency of model predictions across spatially or semantically linked nodes (e.g., geographic regions, event clusters, and user segments). As the regularization coefficient increases, residuals become progressively smoother across the graph structure, indicating enhanced consistency and spatial coherence in model predictions. This validates the role of the proposed graph regularization strategy in mitigating overfitting to isolated patterns and promoting generalizable learning across structurally similar instances. Importantly, this trend affirms the hypothesis that incorporating domain-specific graph priors (such as shared market dynamics, location-based similarities, or operational dependencies) into the training objective not only improves in-distribution accuracy but also fosters structure-aware robustness. Over-regularization (beyond a certain threshold) may lead to performance degradation due to excessive smoothing, thus emphasizing the importance of regularization weight tuning.

To evaluate the robustness of the proposed model under degraded input conditions, controlled ablation experiments are conducted by systematically removing or perturbing one or more modalities. As shown in Table 5, the model maintains relatively stable performance when a single modality, either an event text or event image, is absent. Removing text input causes a marginal increase in MAE from 0.598 to 0.619, while removing image input results in a slightly higher MAE of 0.631, indicating that both modalities provide useful contextual information. When both text and image modalities are removed, the MAE increases noticeably to 0.659, and the ERE rises to 0.673. This confirms the crucial role of the multimodal fusion mechanism in learning semantically enriched and event-aware representations. Injecting Gaussian noise into the time series signal degrades performance similarly (MAE: 0.645), suggesting that while the model shows some resilience to temporal noise, it heavily relies on clean input to identify precise fluctuations in event-driven dynamics. Overall, the model’s graceful degradation under partial modality loss demonstrates the flexibility of the event-aware Multimodal Encoder (BME), while the more significant performance drop in the absence of both contextual signals highlights the importance of joint modality integration. This robustness is critical in real-world applications where data incompleteness or noise is common.

Figure 9 visualizes the semantic space of event-related embeddings using uniform manifold approximation and projection (UMAP), a nonlinear dimensionality reduction technique optimized for preserving local and global manifold structures. High-dimensional embeddings of 50 events are projected into two dimensions, which are color-coded by event category. The visualization reveals well-separated clusters for each event type, confirming the success of the event embedding module in capturing latent distinctions, especially tighter clustering for crisis events, reflecting their distinct impact patterns.

Table 6 reports average contribution weights of each modality across four forecasting scenarios. Temporal features dominate long-term trend tracking (54.5%), while event text importance rises to 41.3% in campaign surge modeling, surpassing time series features and showing dynamic reliance adjustment by the model. Event images contribute moderately and consistently across scenarios, supporting but not dominating predictions. To evaluate generalization across geographically and economically diverse regions, cross-region transfer experiments train the model on one city and test on another. Table 7 shows competitive performance across all pairs without fine-tuning, indicating that learned temporal and semantic patterns transfer well despite regional differences in seasonality, consumer behavior, or event preferences. Minimal fine-tuning on the target region quickly adapts the model, reducing the MAE significantly (e.g., from 0.659 to 0.606 in the Beijing-to-Wuhan transfer). This highlights the model’s adaptability and the Multiscale Graph Relevance module’s effectiveness in bridging domain gaps. Such transferability benefits real-world deployments lacking abundant historical data in some regions, facilitating scalable nationwide forecasting and campaign planning.

6. Conclusions

In this paper, we propose a novel deep learning framework for event-oriented time-series forecasting in the tourism economy, addressing the challenges of multimodal signal integration, short-term volatility modeling, and cross-regional generalization. Built upon the PatchTST backbone, our approach introduces three key modules: the event-aware Multimodal Encoder for adaptive semantic fusion, the Temporal Event Reasoner for dynamic event-sensitive attention, and the Multiscale Graph Relevance module for structural knowledge transfer across geographical regions. Compared to state-of-the-art baselines such as Informer, Autoformer, FEDformer, and PatchTST, our model achieves a 15.1% reduction in MAE and a 19.7% decrease in Event Response Error, demonstrating its robustness and practical significance. Together, these components form a cohesive architecture capable of capturing complex, event-driven temporal patterns from heterogeneous data sources. We acknowledge limitations: the current framework does not incorporate causal reasoning or user-level behavioral streams, which are important directions for future work. Overall, our study provides methodological and empirical insights into how multimodal, event-sensitive, and structurally transferable forecasting models can be built to serve dynamic and geographically distributed domains. Future work will further explore causal reasoning integration and user-level adaptation to enhance interpretability and fine-grained response prediction.

Author Contributions

Conceptualization, S.C.; methodology, S.C.; software, S.C.; validation, S.C.; formal analysis, S.C.; investigation, S.C.; resources, S.C.; data curation, S.C.; writing—original draft preparation, S.C.; writing—review and editing, S.C. and C.Z.; visualization, C.Z.; supervision, C.Z.; project administration, C.Z.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used and/or analyzed in the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, N. Using the fuzzy method and multi-criteria decision making to analyze the impact of digital economy on urban tourism. Int. J. Comput. Intell. Syst. 2024, 17, 122. [Google Scholar] [CrossRef]
Dadashzadeh, M. Data-Driven Decision-Making in a Management Information Systems Course: The Case of Ecotourism Travel Agency. Int. J. Bus. Res. Inf. Technol. (IJBRIT) 2023, 10, 26–44. [Google Scholar]
Wu, B.; Wang, L.; Tao, R.; Zeng, Y.R. Interpretable tourism volume forecasting with multivariate time series under the impact of COVID-19. Neural Comput. Appl. 2023, 35, 5437–5463. [Google Scholar] [CrossRef]
Wei, W.; Wang, G.; Tao, X.; Luo, Q.; Chen, L.; Bao, X.; Liu, Y.; Jiang, J.; Liang, H.; Ye, L. Time series prediction for the epidemic trends of monkeypox using the ARIMA, exponential smoothing, GM (1, 1) and LSTM deep learning methods. J. Gen. Virol. 2023, 104, 001839. [Google Scholar] [CrossRef] [PubMed]
Zheng, X.; Lyu, Y.; Jiang, L.; Paudel, D.P.; Van Gool, L.; Hu, X. Reducing unimodal bias in multi-modal semantic segmentation with multi-scale functional entropy regularization. arXiv 2025, arXiv:2505.06635. [Google Scholar]
Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Funde, Y.; Damani, A. Comparison of ARIMA and exponential smoothing models in prediction of stock prices. J. Predict. Mark. 2023, 17, 21–38. [Google Scholar] [CrossRef]
Ray, S.; Lama, A.; Mishra, P.; Biswas, T.; Das, S.S.; Gurung, B. An ARIMA-LSTM model for predicting volatile agricultural price series with random forest technique. Appl. Soft Comput. 2023, 149, 110939. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Park, S.; Han, S.C.; Hovy, E. UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting. arXiv 2025, arXiv:2508.11954. [Google Scholar] [CrossRef]
Wu, W.; Zhang, G.; Tan, Z.; Wang, Y.; Qi, H. Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts. arXiv 2025, arXiv:2505.01135. [Google Scholar] [CrossRef]
Wang, C.; Qi, Q.; Wang, J.; Sun, H.; Zhuang, Z.; Wu, J.; Zhang, L.; Liao, J. ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data. arXiv 2024, arXiv:2412.11376. [Google Scholar] [CrossRef]
Zhong, S.; Ruan, W.; Jin, M.; Li, H.; Wen, Q.; Liang, Y. Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting. arXiv 2025, arXiv:2502.04395. [Google Scholar]
Nguyen, N.H.; Nguyen, T.T.; Ngo, Q.T. DASF-Net: A Multimodal Framework for Stock Price Forecasting with Diffusion-Based Graph Learning and Optimized Sentiment Fusion. J. Risk Financ. Manag. 2025, 18, 417. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Wang, R.; Chen, D.; Wu, Z.; Chen, Y.; Dai, X.; Liu, M.; Jiang, Y.G.; Zhou, L.; Yuan, L. Bevt: Bert pretraining of video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14733–14743. [Google Scholar]
Chen, L.; Tong, Z.; Song, Y.; Wu, G.; Wang, L. Efficient video action detection with token dropout and context refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10388–10399. [Google Scholar]
Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Nasir, M.; Ezeife, C.I. Semantic enhanced Markov model for sequential E-commerce product recommendation. Int. J. Data Sci. Anal. 2023, 15, 67–91. [Google Scholar] [CrossRef]
Ge, Y.; Jin, M.; Zhao, Y.; Li, H.; Du, B.; Xu, C.; Pan, S. EventTSF: Event-Aware Non-Stationary Time Series Forecasting. arXiv 2025, arXiv:2508.13434. [Google Scholar]
Oskarsson, J.; Sidén, P.; Lindsten, F. Temporal Graph Neural Networks for Irregular Data. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; Ruiz, F., Dy, J., van de Meent, J.W., Eds.; PMLR. 2023; Volume 206, pp. 4515–4531. [Google Scholar]
Hallac, D.; Park, Y.; Boyd, S.; Leskovec, J. Network Inference via the Time-Varying Graphical Lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, Halifax, NS, Canada, 13–17 August 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 205–213. [Google Scholar] [CrossRef]
Ye, J.; Liu, Z.; Du, B.; Sun, L.; Li, W.; Fu, Y.; Xiong, H. Learning the Evolutionary and Multi-scale Graph Structure for Multivariate Time Series Forecasting. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, Washington, DC, USA, 14–18 August 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2296–2306. [Google Scholar] [CrossRef]
Zhao, K.; Guo, C.; Cheng, Y.; Han, P.; Zhang, M.; Yang, B. Multiple Time Series Forecasting with Dynamic Graph Modeling. Proc. VLDB Endow. 2023, 17, 753–765. [Google Scholar] [CrossRef]
Mou, S.; Xue, Q.; Chen, J.; Takiguchi, T.; Ariki, Y. MM-iTransformer: A Multimodal Approach to Economic Time Series Forecasting with Textual Data. Appl. Sci. 2025, 15, 1241. [Google Scholar] [CrossRef]
Wang, X.; Zhao, Z.; Wang, R.; Xu, Y. Event-aware analysis of cross-city visitor flows using large language models and social media data. arXiv 2025, arXiv:2505.03847. [Google Scholar]
Hamou, O.; Oudgou, M.; Boudhar, A. Analysis of the Effectiveness of Classical Models in Forecasting Volatility and Market Dynamics: Insights from the MASI and MASI ESG Indices in Morocco. J. Risk Financ. Manag. 2025, 18, 370. [Google Scholar] [CrossRef]
Fang, Y.; Qin, Y.; Luo, H.; Zhao, F.; Xu, B.; Zeng, L.; Wang, C. When spatio-temporal meet wavelets: Disentangled traffic forecasting via efficient spectral graph attention networks. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; IEEE: New York, NY, USA, 2023; pp. 517–529. [Google Scholar]
Yan, Z.; Lu, X.; Wu, L.; Liu, F.; Qiu, R.; Cui, Y.; Ma, X. Evaluation of precipitation forecasting base on GraphCast over mainland China. Sci. Rep. 2025, 15, 14771. [Google Scholar] [CrossRef]
Luo, M.; Dou, H.; Zheng, N. Spatiotemporal Prediction of Urban Traffics Based on Deep GNN. Comput. Mater. Contin. 2024, 78, 265–282. [Google Scholar] [CrossRef]
Jin, Y.; Chen, K.; Yang, Q. Transferable graph structure learning for graph-based traffic forecasting across cities. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 1032–1043. [Google Scholar]
Fan, J.; Lu, W.; Abduraimovna, S.B.; Cheng, J.; Fan, H. Graph-guided neural network for tourism demand forecasting. IEEE Access 2023, 11, 134259–134268. [Google Scholar] [CrossRef]
Kim, K.; Tsai, H.; Sen, R.; Das, A.; Zhou, Z.; Tanpure, A.; Luo, M.; Yu, R. Multi-Modal Forecaster: Jointly Predicting Time Series and Textual Data. arXiv 2024, arXiv:2411.06735. [Google Scholar] [CrossRef]
Brenner, M.; Hess, F.; Koppe, G.; Durstewitz, D. Integrating Multimodal Data for Joint Generative Modeling of Complex Dynamics. arXiv 2024, arXiv:2212.07892. [Google Scholar] [CrossRef]
Ang, G.; Lim, E.P. Temporal Implicit Multimodal Networks for Investment and Risk Management. ACM Trans. Intell. Syst. Technol. 2024, 15, 38. [Google Scholar] [CrossRef]
Chen, H.; Eldardiry, H. Graph Time-series Modeling in Deep Learning: A Survey. ACM Trans. Knowl. Discov. Data 2024, 18, 119. [Google Scholar] [CrossRef]
Kacprzyk, J.; Zaporozhets, A.; Khaidurov, V. Optimization Approach to Forecasting Satellite Meteorological Data for Ecology and Energy. In Systems, Decision and Control in Energy VII: Volume II: Power Engineering and Environmental Safety; Babak, V., Zaporozhets, A., Eds.; Springer: Cham, Switzerland, 2025; pp. 559–582. [Google Scholar] [CrossRef]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. Etsformer: Exponential smoothing transformers for time-series forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed framework for tourism event forecasting.

Figure 2. Architecture of the event-aware Multimodal Embedding (EME) module.

Figure 5. Forecasting accuracy comparison between models under campaign-induced fluctuations. Shaded bands represent variability across 5-fold runs, and dashed vertical lines indicate event start dates.

Figure 6. Distribution of forecasting residuals for PatchTST and our model. Overlayed kernel density curves highlight the narrower variance and reduced extreme deviations in our approach.

Figure 7. Prediction error (MAE) in a

\pm 7

-day window centered on event dates. Day 0 (dashed vertical line) marks the event occurrence.

Figure 7. Prediction error (MAE) in a

\pm 7

-day window centered on event dates. Day 0 (dashed vertical line) marks the event occurrence.

Figure 8. Schematic of the influence of graph structure smoothness.

Figure 9. UMAP projection of the event embedding space. (a) Original high-dimensional embeddings of 50 events visualized in 2D space, showing scattered distributions colored by event type: crisis, holiday, and promo. (b) Clearer cluster formation after training, demonstrating that the model effectively learns semantically discriminative representations, with well-separated event categories.

Table 1. Training and implementation details for reproducibility.

Item	Value
Patch size s	8
Embedding dimension $d_{z}$	128
Vision encoder	ViT-B/16 (pretrained, frozen unless noted)
Text encoder	BERT-base (pretrained, frozen unless noted)
Optimizer	AdamW
Base learning rate	0.001
Weight decay	1 × $10^{- 4}$
Batch size	64
Epochs	100 (early stopping patience = 10)
Warm-up	10 epochs (linear)
Contrastive temperature $τ$	0.07
Contrastive weight $λ_{1}$	0.5
Graph regularization weight $λ_{2}$	0.05
RGAT layers per scale	2 (daily, weekly, seasonal)
Dropout	0.1
Random seeds	{0, 1, 2}
Framework	PyTorch 2.1.0 + CUDA 11.8
Hardware	NVIDIA A100 (80 GB)

Table 2. The results of the comparative experiments (5-fold cross-validation mean ± std).

Model	MAE ↓	RMSE ↓	BRI ↑ (%)	ERE ↓ (%)	Params (M)
Informer [10]	0.812 ± 0.021	1.107 ± 0.034	–	0.853 ± 0.018	11.2
Autoformer [11]	0.768 ± 0.019	1.029 ± 0.031	–	0.802 ± 0.017	12.8
ETSformer [40]	0.741 ± 0.018	0.998 ± 0.029	–	0.768 ± 0.016	13.1
PatchTST [7]	0.704 ± 0.016	0.954 ± 0.027	0.00%	0.736 ± 0.015	14.5
CLIP [17]	0.678 ± 0.015	0.901 ± 0.025	+3.69%	0.695 ± 0.014	23.4
GraphCast [31]	0.656 ± 0.014	0.888 ± 0.024	+6.82%	0.667 ± 0.013	16.4
EVAD [19]	0.643 ± 0.013	0.872 ± 0.023	+8.66%	0.651 ± 0.013	18.6
BEVT-MOE [18]	0.631 ± 0.013	0.856 ± 0.022	+10.37%	0.634 ± 0.012	22.9
Ours	0.598 ± 0.012	0.803 ± 0.020	+15.06%	0.592 ± 0.011	14.7

Table 3. Ablation results on the TravelEventOps dataset (5-fold cross-validation mean ± std).

Model Variant	MAE ↓	RMSE ↓	BRI ↑ (%)	ERE ↓ (%)
Full Model (Ours)	0.598 ± 0.012	0.803 ± 0.020	+15.06%	0.592 ± 0.011
w/o MGRM	0.622 ± 0.013	0.835 ± 0.022	+11.65%	0.613 ± 0.012
w/o TER	0.641 ± 0.014	0.867 ± 0.024	+9.01%	0.645 ± 0.013
w/o EME	0.666 ± 0.015	0.892 ± 0.025	+5.39%	0.688 ± 0.014
PatchTST	0.704 ± 0.016	0.954 ± 0.027	0.00%	0.736 ± 0.015

Table 4. Parameter sensitivity analysis (5-fold cross-validation mean ± std).

Parameter	Test Values	Best	MAE ↓	RMSE ↓	ERE ↓ (%)
Contrastive Loss Weight	0.0/0.5/1.0	0.5	0.598 ± 0.012	0.803 ± 0.020	0.592 ± 0.011
Graph Reg. Weight ( $λ_{2}$ )	0.0/0.05/0.1	0.05	0.598 ± 0.012	0.803 ± 0.020	0.592 ± 0.011
Patch Length ( $l_{p}$ )	4/8/16	8	0.598 ± 0.012	0.803 ± 0.020	0.592 ± 0.011

Table 5. Robustness test under missing or noisy modalities.

Degradation Type	MAE ↓	RMSE ↓	ERE ↓ (%)
Full Modalities (Ours)	0.598	0.803	0.592
Missing Event Text Input	0.619	0.828	0.624
Missing Event Image Input	0.631	0.846	0.639
Missing Both Text and Image	0.659	0.887	0.673
Gaussian Noise on Time Series	0.645	0.871	0.651

Table 6. Average modality contribution weights.

Scenario	Time Series (%)	Event Text (%)	Event Image (%)
Regional Forecast	43.2	31.6	25.2
Campaign Surge Modeling	28.9	41.3	29.8
Long-Term Trend Tracking	54.5	24.6	20.9
Post-Event Recovery	33.1	36.5	30.4

Table 7. Cross-region forecasting performance.

Train Region	Test Region	MAE ↓	RMSE ↓	ERE ↓	Fine-Tuned MAE ↓
Beijing	Guangzhou	0.634	0.848	0.631	0.599
Shanghai	Guilin	0.648	0.866	0.647	0.601
Beijing	Wuhan	0.659	0.879	0.655	0.606

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, S.; Zhou, C. Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning. Symmetry 2025, 17, 1788. https://doi.org/10.3390/sym17111788

AMA Style

Cao S, Zhou C. Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning. Symmetry. 2025; 17(11):1788. https://doi.org/10.3390/sym17111788

Chicago/Turabian Style

Cao, Shu, and Can Zhou. 2025. "Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning" Symmetry 17, no. 11: 1788. https://doi.org/10.3390/sym17111788

APA Style

Cao, S., & Zhou, C. (2025). Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning. Symmetry, 17(11), 1788. https://doi.org/10.3390/sym17111788

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Event-Aware Multimodal Time-Series Forecasting via Symmetry-Preserving Graph-Based Cross-Regional Transfer Learning

Abstract

1. Introduction

2. Related Work

2.1. Time-Series Forecasting in Socio-Economic and Behavioral Modeling

2.2. Multimodal Learning for Forecasting and Decision Support

2.3. Event-Aware and Volatility-Sensitive Forecasting

2.4. Graph-Based Transfer Learning and Regional Adaptation

3. Method

3.1. Event-Aware Multimodal Representation Learning

3.2. Temporal Event Reasoning Module

3.3. Multiscale Graph Reasoning Module

3.4. Method Summary

4. Experimental Results

4.1. Implementation and Evaluation Protocol

4.2. Datasets

4.3. Experimental Setup

4.4. Evaluation Metrics

4.5. Result and Discussion

5. Ablation Study and Robustness Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI