Symmetry-Aware Transformers for Asymmetric Causal Discovery in Financial Time Series

Zheng, Wenxia; Liu, Wenhe

doi:10.3390/sym17101591

Open AccessArticle

Symmetry-Aware Transformers for Asymmetric Causal Discovery in Financial Time Series

by

Wenxia Zheng

^1,* and

Wenhe Liu

²

¹

Department of Economics, Texas A&M University, College Station, TX 77840, USA

²

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(10), 1591; https://doi.org/10.3390/sym17101591

Submission received: 28 July 2025 / Revised: 24 August 2025 / Accepted: 6 September 2025 / Published: 24 September 2025

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

Financial markets exhibit fundamental asymmetries in temporal causality, where policy interventions create asymmetric transmission patterns that traditional symmetric modeling approaches fail to capture. This work introduces a mathematical framework that exploits the inherent symmetries of transformer architectures while preserving essential asymmetric temporal relationships in financial causal inference. We develop CausalFormer, a symmetry-aware neural architecture that maintains the permutation equivariance properties of self-attention mechanisms while enforcing strict temporal asymmetry constraints for causal discovery. The framework incorporates three mathematically principled components: (1) a symmetric attention matrix construction with asymmetric temporal masking that preserves the mathematical elegance of transformer operations while ensuring causal consistency, (2) a multi-scale convolution module with symmetric kernel initialization but asymmetric temporal receptive fields that captures policy transmission effects across heterogeneous time horizons, and (3) enhanced Nelson–Siegel decomposition that maintains the symmetric factor structure while modeling the evolution dynamics of asymmetric factors. Our mathematical formulation establishes the formal symmetry properties of the attention mechanism under temporal transformations while proving asymmetric convergence behaviors in policy transmission scenarios. The integration of symmetric optimization landscapes with asymmetric causal constraints enables simultaneous achievement of mathematical elegance and economic interpretability. Comprehensive experiments on monetary policy datasets demonstrate that the symmetry-aware design achieves a 15.3% improvement in the accuracy of causal effect estimations and a 12.7% enhancement in the predictive performance compared to those for existing methods while maintaining 91.2% causal consistency scores. The framework successfully identifies asymmetric policy transmission mechanisms, revealing that monetary tightening exhibits 40% faster propagation than easing policies, establishing new mathematical insights into the temporal asymmetries in financial systems. This work demonstrates how principled exploitation of architectural symmetries combined with domain-specific asymmetric constraints opens up new directions for mathematically rigorous yet economically interpretable deep learning in financial econometrics, with broad applications spanning computational finance, economic forecasting, and policy analysis.

Keywords:

financial time series; causal discovery; symmetry; transformer

1. Introduction

Financial markets operate as complex adaptive systems where the accurate identification of causal relationships between economic variables, policy interventions, and market outcomes has become increasingly critical for central bank policy formulation, risk management, and economic forecasting [1,2]. The COVID-19 pandemic and subsequent monetary policy responses highlighted the urgent need for models that can both predict market dynamics and provide transparent causal insights into transmission mechanisms [3]. However, existing approaches face a fundamental tension: the traditional econometric methods offer interpretable causal inference but struggle with high-dimensional nonlinear dynamics, while modern deep learning approaches achieve a superior predictive performance but operate as “black boxes” that provide limited causal understanding.

This interpretability–accuracy trade-off poses significant challenges for financial practitioners and policy makers who require both precise forecasts and transparent causal mechanisms for evidence-based decision making. Vector Autoregression (VAR) models and their structural variants, despite their widespread adoption in monetary policy analyses [4,5], often impose overly restrictive linearity assumptions that may not capture the complex regime changes and threshold effects characterizing modern financial markets. Conversely, transformer architectures and other deep learning approaches [6,7] demonstrate remarkable predictive capabilities but fail to provide the causal interpretability essential for regulatory compliance and policy formulation.

Recent advances in causal inference methodology have provided rigorous frameworks for identifying causal relationships from observational data [8,9]. However, the integration of these methodologies with modern deep learning architectures for sequential financial data remains largely unexplored. Existing attempts to combine causal inference with machine learning typically treat prediction and causal discovery as separate sequential steps, failing to leverage the potential synergies between these complementary objectives.

This paper introduces CausalFormer, a novel transformer architecture that fundamentally bridges the gap between the predictive performance and causal interpretability in financial time series analysis. Our framework represents a paradigm shift by embedding causal inference mechanisms directly within transformer blocks, enabling simultaneous optimization for both predictive accuracy and causal validity. Unlike existing approaches that compromise one objective for the other, CausalFormer demonstrates that rigorous causal constraints can enhance rather than hinder the predictive performance when properly integrated into the architectural design.

The key innovation lies in three mathematically principled components that work synergistically to achieve this dual optimization. First, we develop a causal self-attention mechanism that enforces temporal priority constraints while incorporating the learned causal structure, ensuring that the attention patterns respect economic theory without sacrificing the expressive power of the transformers. Second, we introduce a multi-kernel causal convolution module that captures policy transmission effects across heterogeneous time horizons, recognizing that the impacts of monetary policy propagate through different temporal channels with varying speeds and magnitudes. Third, we present an enhanced Nelson–Siegel decomposition layer that maintains the interpretable factor structure essential for a yield curve analysis while allowing the neural network flexibility to capture nonlinear factor dynamics.

The integration of DoWhy causal inference frameworks [10] within our architecture enables a comprehensive uncertainty quantification for causal effects, addressing a critical limitation in financial modeling where policy decisions must account for estimation uncertainty. Furthermore, our framework incorporates advanced scenario generation techniques using policy surprise identification strategies, providing regulatory-compliant stress testing capabilities essential for Basel III compliance and internal risk model validation.

Our contributions establish three fundamental advances in computational finance. First, we demonstrate that architectural symmetries in transformers can be systematically exploited while preserving the asymmetric temporal relationships essential for causal inference, opening new directions for mathematically principled neural architecture design. Second, we provide empirical evidence that causal consistency and predictive performance are mutually reinforcing rather than competing objectives, with CausalFormer achieving a 15.3% improvement in the accuracy of causal effect estimations and a 12.7% enhancement in the predictive performance compared to those for existing methods while maintaining 91.2% causal consistency scores. Third, we reveal novel insights into monetary policy transmission mechanisms, including previously undocumented asymmetric channels where policy tightening exhibits 40% faster propagation than easing policies, providing actionable intelligence for central bank policy design.

The remainder of this paper systematically develops our theoretical framework and empirical validation. Section 2 positions our work within the existing literature spanning causal inference, transformer architectures, and financial econometrics. Section 3 presents the mathematical foundation and detailed architecture of CausalFormer with emphasis on the implementation details. Section 4 describes our comprehensive experimental design and datasets. Section 5 presents empirical results demonstrating a superior performance across multiple evaluation dimensions. Section 6 discusses policy implications and practical applications, while Section 6 concludes with directions for future research in causal deep learning for finance.

2. Related Work

The development of CausalFormer draws upon several interconnected research streams spanning econometric modeling, deep learning architectures, causal inference methodologies, and their applications to financial markets. This section provides a comprehensive review of these foundational areas and positions our contribution within the broader literature.

2.1. Traditional Econometric Approaches in Financial Time Series

Classical econometric methods have long served as the cornerstone of financial time series analysis, with Vector Autoregression (VAR) models representing the dominant paradigm for modeling multivariate financial relationships [4]. The seminal work [4] established VAR methodology as the standard approach to analyzing dynamic relationships in macroeconomic and financial data, providing a framework that treats all variables as endogenous and allows for rich lag structures.

Extensions to the basic VAR framework have addressed various limitations through structural identification. Blanchard et al. [5] introduced the concept of SVAR models, which impose economic-theory-based restrictions to identify structural shocks. This approach has been particularly influential in monetary policy analysis, where researchers seek to identify the causal effects of policy interventions. The long-run restriction methodology developed by [5] has become a standard tool for policy analysis.

Cointegration analysis [11] has provided another crucial dimension to financial econometric modeling. Vector Error Correction Models (VECMs) have enabled researchers to model both short-run dynamics and long-run equilibrium relationships, proving particularly valuable for yield curve modeling and term structure analysis.

Despite these advances, the traditional econometric approaches face significant limitations when applied to high-frequency financial data. The curse of dimensionality becomes particularly acute in VAR models as the number of parameters grows quadratically with the number of variables and lags. Additionally, the assumption of linearity underlying most econometric models may be overly restrictive for financial data characterized by regime changes, threshold effects, and complex nonlinear dynamics.

2.2. Deep Learning in Financial Applications

The application of deep learning methodologies to financial problems has experienced remarkable growth over the past decade, driven by the availability of large-scale datasets and computational advances. Early applications focused primarily on price predictions and algorithmic trading, with White et al. [12] providing one of the first comprehensive applications of neural networks to financial forecasting.

Recurrent Neural Networks (RNNs) and their variants have shown particular promise for sequential financial data. Rather et al. [13] demonstrated the effectiveness of RNNs for stock price prediction, while Nelson et al. [14] showed that LSTM networks could capture long-term dependencies in financial time series more effectively than traditional approaches. The gating mechanisms in LSTM architectures have proven particularly valuable for financial applications where both short-term fluctuations and long-term trends are important.

Convolutional Neural Networks (CNNs) have found applications in financial signal processing and pattern recognition. Sezer et al. [15] provided a comprehensive review of CNN applications in financial markets, highlighting their effectiveness in extracting features from high-dimensional financial data. The work [16] on temporal convolutional networks has shown promise for capturing multi-scale temporal patterns in financial time series.

More recently, attention mechanisms and transformer architectures have gained prominence in financial applications. Feng et al. [17] demonstrated that attention mechanisms could improve the performance of RNNs for financial forecasting by allowing models to focus on relevant historical information. The transformer architecture, introduced by [6], has shown remarkable success in financial applications, with Wu et al. [18] demonstrating its superior performance in stock price prediction tasks.

Generative Adversarial Networks (GANs) have opened new possibilities for financial modeling and risk assessment. Wiese et al. [19] showed that GANs could generate realistic financial time series for stress testing and scenario analyses. The work of [20] on conditional GANs has demonstrated the potential for generating financial scenarios conditioned on specific market conditions or policy interventions.

2.3. Causal Inference in Econometrics and Machine Learning

The field of causal inference has experienced a renaissance in recent decades, with contributions from both econometrics and computer science communities. The potential outcome framework [9] has provided a rigorous mathematical foundation for causal inference. This framework, often referred to as the Rubin Causal Model (RCM), has become central to modern causal analyses in economics and finance [21].

Pearl’s causal hierarchy, articulated in [8], has fundamentally changed how researchers approach causal questions. The distinction between association, intervention, and counterfactual reasoning has provided a systematic framework for understanding different types of causal queries. Pearl’s do-calculus has enabled researchers to derive causal effects from observational data under specific assumptions about the causal structure [22].

Directed Acyclic Graphs (DAGs) have emerged as a powerful tool for representing causal relationships and identifying potential confounders. The work on DAGs in epidemiology was extended to economic applications by [23]. The identification of instrumental variables through a DAG analysis has proven particularly valuable for financial applications where randomized experiments are infeasible [24].

Causal discovery algorithms represent another important strand of the literature on causal inference. The PC algorithm [25] has provided automated approaches to learning the causal structure from data. More recent developments, such as the Fast Causal Inference (FCI) algorithm [26], have extended these methods to handling latent confounders and selection bias.

The integration of causal inference with machine learning has gained momentum through the development of causal machine learning frameworks. Athey et al. [27] introduced causal trees for heterogeneous treatment effects, while Wager et al. [28] developed causal forests for high-dimensional causal inference. The work of [29] on double machine learning has provided a framework for using machine learning methods in causal inference while maintaining valid statistical inference.

2.4. Causal Inference in Financial Markets

The application of causal inference methods to financial markets has been an active area of research, particularly in the context of policy evaluations and market microstructure analyses. Angrist et al. [30] emphasized the importance of credible identification strategies in financial economics, advocating for the use of natural experiments and instrumental variables.

The identification of monetary policy shocks has been a particular focus of causal analyses in financial markets. Romer et al. [31] developed narrative approaches to identifying monetary policy shocks, while Gertler et al. [32] used high-frequency identification strategies. The work [33] on the high-frequency identification of the transmission of monetary policy provided new insights into the causal effects of policy interventions on financial markets.

Regression discontinuity designs have found applications in financial regulation studies. Christoffersen et al. [34] used RDD to study the effects of governance regulations on firm value, while Bradley et al. [35] applied these methods to analyzing the impact of credit rating changes. The work [36] on mortgage market regulations has demonstrated the potential for RDD in financial policy evaluation.

2.5. Transformer Architectures and Financial Applications

The transformer architecture [6] has revolutionized sequence modeling across various domains. The self-attention mechanism allows models to capture long-range dependencies without the computational constraints of recurrent architectures. Bert [37] demonstrated the effectiveness of bidirectional transformers for language understanding, while Brown et al. [38] showed that large-scale transformer models could achieve a remarkable performance across diverse tasks.

Financial applications of transformer architectures have shown promising results. Yoo [39] applied transformers to stock price prediction, demonstrating a superior performance compared to that of traditional RNN-based approaches. The work [40] on portfolio optimization showed that transformer-based models could effectively capture the complex dependencies in multi-asset portfolios.

Attention mechanisms have proven particularly valuable for financial time series analysis. Qin et al. [41] introduced dual-stage attention mechanisms for financial forecasting, while Shih et al. [42] developed temporal attention networks for volatility prediction. The work [43] on hierarchical attention networks has shown promise for modeling multi-scale financial patterns.

Recent developments in transformer architectures have focused on improving efficiency and interpretability. Kitaev et al. [44] introduced the Reformer architecture to address the computational limitations of standard transformers, while Wang et al. [45] developed linear attention mechanisms. The work [46] on sparse transformers has enabled their application to longer sequences, which is particularly relevant for financial time series analyses.

2.6. Yield Curve Modeling and Term Structure Analysis

The modeling of yield curves and term structure dynamics represents a specialized area where both traditional econometric methods and modern machine learning approaches have been applied. The Nelson–Siegel model, proposed by [47], has become a standard framework for parameterizing yield curves. The extension by Svensson et al. [48] has enhanced the model’s flexibility and forecasting performance.

Dynamic factor models have provided another important approach to yield curve modeling. Diebold et al. [49] showed that a small number of factors could explain most of the variation in yield curves, leading to the development of dynamic Nelson–Siegel models.

Machine learning approaches to yield curve modeling have gained traction in recent years. Bauer et al. [50] applied a principal component analysis and regularization techniques to international yield curve data, while Exterkate et al. [51] used neural networks for term structure forecasting. The work [52] on deep learning for yield curve modeling has demonstrated the potential for neural networks to capture complex nonlinear relationships in term structure data.

2.7. The Integration of Causal Inference and Deep Learning

The integration of causal inference principles with deep learning architectures represents an emerging and rapidly evolving research area. Shanmugam et al. [53] provided a comprehensive framework for causal representation learning, while Scholkopf et al. [54] argued for the importance of causal thinking in machine learning more broadly.

Causal attention mechanisms have been explored in various contexts. Wang [55] introduced causal attention for natural language processing, while Sui et al. [56] developed causal transformer architectures for sequential decision-making. The work [57] on temporal fusion transformers has incorporated causal mechanisms for time series forecasting.

Structural causal models have been integrated with neural networks to create interpretable deep learning architectures. Goudet et al. [58] developed causal generative neural networks, while Khemakhem et al. [59] introduced identifiable variational autoencoders for causal representation learning. The work [60] on deep structural causal models has provided a framework for learning causal representations from high-dimensional data.

The DoWhy framework, developed by [10], has provided a unified interface for causal inference that can be integrated with machine learning pipelines. This framework has enabled researchers to combine the predictive power of machine learning with the interpretability of causal inference methods.

2.8. Gaps in the Existing Literature

Despite significant advances in each research area, several critical gaps remain that motivate our work. Most importantly, existing hybrid approaches that combine econometric and deep learning methods typically employ sequential architectures, where VAR/SVAR models first identify structural shocks and then feed these estimates into neural networks for prediction. These sequential methods treat causal identification as a preprocessing step, fundamentally losing the bidirectional feedback between causal structure learning and predictive modeling that characterizes real financial systems. The structural shocks identified in the first stage carry forward estimation uncertainty that compounds in subsequent neural network training, while the predictive models cannot inform causal structure refinements.

Traditional hybrid VAR–neural approaches face particular limitations in capturing asymmetric transmission mechanisms. These methods inherit the symmetric linear foundations of VAR models, requiring ad hoc modifications such as threshold VAR or regime-switching extensions to accommodate policy asymmetries. Such approaches necessitate separate model estimations for different regimes, losing information about the transition dynamics and requiring a priori specification of the regime-switching mechanisms. The resulting models cannot discover asymmetric effects endogenously but must impose them through predetermined structural assumptions.

The existing deep learning approaches to financial time series, while achieving an impressive predictive performance, typically lack the causal interpretability required for policy analysis and risk management [7]. Standard transformer architectures and LSTM networks operate as “black boxes” that provide limited insight into the underlying transmission mechanisms driving observed relationships. This opacity prevents their adoption in regulatory contexts where model explainability is mandatory for stress testing and capital adequacy assessment.

The integration of causal inference with transformer architectures remains largely unexplored, particularly for financial applications where temporal causality is crucial. Existing causal machine learning frameworks such as double machine learning [29] and causal forests [28] focus primarily on cross-sectional treatment effect estimation rather than temporal causal discovery in sequential data. The few attempts to combine transformers with causal inference [57] have treated causality as a constraint rather than leveraging it for improved learning.

CausalFormer addresses these gaps through joint optimization of causal structure learning and predictive modeling within a unified transformer architecture. Unlike sequential hybrid approaches, our framework enables simultaneous discovery of both causal relationships and asymmetric propagation dynamics through multi-kernel causal convolutions and policy-specific attention mechanisms. The architecture naturally accommodates regime-dependent transmission patterns without requiring separate model specification, enabling endogenous discovery of phenomena such as the asymmetric policy transmission we demonstrate empirically. Our approach fundamentally differs from existing methods by treating causal constraints as architectural features that enhance rather than limit the model’s expressiveness, achieving a superior performance in both causal inference and prediction tasks within a single unified framework.

3. Methodology

This section presents the theoretical foundation and detailed architecture of CausalFormer, our novel transformer-based framework for causal inference in financial time series analysis. We begin by establishing the mathematical foundation for integrating causal mechanisms within transformer architectures and then systematically describe each component of our framework, with emphasis on practical implementation and rigorous logical development.

3.1. The Theoretical Framework

The foundation of CausalFormer rests on the principled integration of structural causal models with transformer architectures for financial time series analysis. We consider a d-dimensional financial time series

X_{t}

= {[X_{1, t}, X_{2, t}, \dots, X_{d, t}]}^{T}

at time t, where each component corresponds to distinct financial variables such as interest rates, bond yields, or monetary policy indicators. The underlying assumption is that these variables follow a structural causal model governing their temporal evolution, expressed as

X_{i, t} = f_{i} (pa (X_{i, t}), ϵ_{i, t})

(1)

where

pa (X_{i, t})

denotes the parents of

X_{i, t}

in the causal graph, including both contemporaneous and lagged variables, and

ϵ_{i, t}

represents exogenous noise terms.

The challenge of learning both functional relationships

f_{i}

and the causal structure from observational financial data requires careful consideration of the temporal consistency. We formalize temporal causality through a temporal priority ordering ≺ such that

X_{i, s} ≺ X_{j, t} \Leftrightarrow s < t or (s = t and i causally precedes j)

(2)

This ordering ensures that causal relationships respect the fundamental principle that causes must precede effects in time, which is particularly crucial in financial markets where information propagation and policy transmission mechanisms operate through well-defined temporal channels.

We represent the causal structure through a DAG

G = (V, E)

where

V

represents financial variables and

E

represents causal relationships. For financial time series, we extend this to a temporal DAG where the edges respect temporal precedence:

G_{T} = {(X_{i, s}, X_{j, t}) \in E : X_{i, s} ≺ X_{j, t}}

(3)

The identification of causal effects follows the do-calculus, where interventions into the monetary policy variables

Z_{t}

enable the estimation of their causal impact on target variables

Y_{t}

through

P (Y_{t} | do (Z_{t} = z)) = \sum_{w} P (Y_{t} | Z_{t} = z, W_{t} = w) P (W_{t} = w)

(4)

where

W_{t}

represents conditioning variables satisfying the backdoor criterion.

3.2. An Overview of the CausalFormer Architecture

The CausalFormer architecture represents a fundamental advancement in integrating causal inference mechanisms directly into transformer frameworks through three interconnected innovations operating within a parallel computational structure. The first innovation involves a causal self-attention mechanism that enforces temporal priority constraints while incorporating the learned causal structure. Following attention processing, the second and third components operate in parallel: a multi-kernel causal convolution module captures policy transmission effects across multiple time horizons, while an enhanced Nelson–Siegel decomposition layer maintains causal consistency in yield curve modeling through interpretable factor structures.

As illustrated in Figure 1, each transformer layer processes attended representations through parallel computational streams. The multi-kernel causal convolution module and the Nelson–Siegel decomposition layer process the same attended input simultaneously, generating specialized feature representations that capture distinct aspects of financial dynamics. The causal convolution stream focuses on multi-scale policy transmission effects through dilated temporal kernels, while the Nelson–Siegel stream models yield curve factor evolution through interpretable mathematical structures. These parallel outputs are then combined through a learned fusion mechanism with the attention weights

γ_{c o n v}

and

γ_{N S}

:

h_{f u s e d} = γ_{c o n v} \cdot h_{c o n v} + γ_{N S} \cdot h_{N S}

(5)

This parallel design enables simultaneous optimization of temporal pattern recognition and structural factor modeling while maintaining computational efficiency through specialized processing paths. The complete architecture processes financial time series

X \in R^{T \times d}

through L transformer layers, culminating in outputs that provide both predictive estimates and causal effect quantifications with associated uncertainty measures.

3.3. The Causal Self-Attention Mechanism

The standard transformer self-attention mechanism computes attention weights based purely on content similarity without considering causal constraints, which is inadequate for financial applications where temporal causality is fundamental. Our causal self-attention mechanism addresses this limitation by incorporating both temporal priority constraints and the learned causal structure into the attention computation process, as shown in Figure 2.

The temporal priority constraints are enforced through a modified attention mask

M^{c a u s a l} \in R^{T \times T}

that extends beyond the standard causal masking used in language models. This mask is defined as

M_{i, j}^{c a u s a l} = \{\begin{matrix} 0 & if X_{i} ≺ X_{j} \\ - \infty & otherwise \end{matrix}

(6)

where the indices

i, j \in {1, 2, \dots, T}

represent temporal positions, ensuring that the attention weights are zero for temporally inadmissible connections.

Beyond temporal constraints, the mechanism integrates the learned causal structure through an adaptive attention formulation. Let

A^{s t r u c t u r e}

represent the adjacency matrix of the estimated causal DAG, which is learned jointly with the model parameters. The causal attention weights are computed as

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + M^{c a u s a l} + λ A^{s t r u c t u r e}) V

(7)

where the dimensional compatibility ensures valid matrix operations:

Q \in R^{T \times d_{k}}

,

K \in R^{T \times d_{k}}

,

V \in R^{T \times d_{v}}

,

M^{c a u s a l} \in R^{T \times T}

, and

A^{s t r u c t u r e} \in R^{T \times T}

. Here, T represents the sequence length (time steps);

d_{k}

denotes the key/query dimension; and

d_{v}

represents the value dimension. The attention score matrix

Q K^{T} \in R^{T \times T}

maintains identical dimensions to both the causal mask and the structure matrix, enabling direct element-wise addition operations within the softmax argument.

The parameter

λ

is a learnable scalar controlling the influence of the causal structure on attention weights. This formulation enables the model to leverage both data-driven attention patterns and theoretically motivated causal relationships.

The multi-head extension of causal attention allows specialization across different aspects of causal relationships. Each attention head is formulated as

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

(8)

where individual heads focus on different temporal horizons and causal relationship types. This design enables simultaneous capture of immediate causal effects, such as intraday market responses to policy announcements, and longer-term transmission mechanisms that operate through complex institutional channels.

3.4. The Multi-Kernel Causal Convolution Module

Financial policy transmissions operate across fundamentally different time scales, from immediate market responses measured in minutes to long-term structural adjustments that unfold over quarters or years. Our multi-kernel causal convolution module captures these multi-scale dynamics while maintaining strict causal ordering throughout the temporal hierarchy.

The temporal convolution implementation employs dilated causal convolutions with multiple kernel sizes to capture policy transmission effects across different time scales. Each convolution operation is defined as

h_{t}^{(k)} = \sum_{i = 0}^{k - 1} w_{i}^{(k)} \cdot x_{t - i}

(9)

where k represents the kernel size, and the summation strictly includes only past values to ensure causal consistency. The dilation rates

d \in {1, 2, 4, 8, 16}

create a hierarchical receptive field that captures both immediate market reactions and longer-term transmission mechanisms without violating temporal causality.

The recognition that different monetary policy instruments exhibit distinct transmission patterns motivates the introduction of policy-specific kernel weights. We implement separate kernel weights

W_{p}^{p o l i c y}

for different policy types

p \in {rates, QE, forward guidance}

, where the convolution output becomes

h_{t}^{p o l i c y} = \sum_{p} α_{p} \cdot Conv 1 D (x_{t}; W_{p}^{p o l i c y})

(10)

The attention weights

α_{p}

are learned parameters that enable adaptive weighting of the transmission mechanisms based on the prevailing policy context, allowing the model to recognize that quantitative easing operates through different channels than conventional interest rate policy.

Cross-scale feature fusion integrates information across different temporal scales while preserving causal consistency through a hierarchical mechanism. The fusion process is implemented as

h_{f u s e d} = LayerNorm (\sum_{k} β_{k} \cdot h^{(k)} + h_{r e s i d u a l})

(11)

where

β_{k}

are learned fusion weights that determine the relative importance of different temporal scales, and

h_{r e s i d u a l}

provides skip connections to preserve the information flow and prevent degradation during deep network training.

3.5. The Enhanced Nelson–Siegel Decomposition Layer

Yield curve modeling requires specialized treatment that captures the underlying factor structure while maintaining causal relationships between factors and their economic determinants. Our enhanced Nelson–Siegel framework extends the traditional model with causal constraints and neural network flexibility while preserving the interpretability that makes Nelson–Siegel models valuable for policy analysis.

The causal factor decomposition builds upon the traditional Nelson–Siegel representation, where yield curves are expressed through three interpretable factors: level (

L_{t}

), slope (

S_{t}

), and curvature (

C_{t}

). The model represents yields as

y_{t} (τ) = L_{t} + S_{t} \cdot \frac{1 - e^{- λ τ}}{λ τ} + C_{t} \cdot (\frac{1 - e^{- λ τ}}{λ τ} - e^{- λ τ})

(12)

but extends this framework to incorporating causal dependencies where the factors themselves depend causally on policy variables and macroeconomic conditions. This extension is formalized through

\begin{matrix} L_{t} & = f_{L} (Z_{t - 1}, L_{t - 1}, ϵ_{t}^{L}) \end{matrix}

(13)

\begin{matrix} S_{t} & = f_{S} (Z_{t - 1}, S_{t - 1}, L_{t}, ϵ_{t}^{S}) \end{matrix}

(14)

\begin{matrix} C_{t} & = f_{C} (Z_{t - 1}, C_{t - 1}, L_{t}, S_{t}, ϵ_{t}^{C}) \end{matrix}

(15)

where the causal ordering reflects the hierarchical nature of the yield curve factor relationships.

The neural parameterization of factor evolution functions employs Multi-Layer Perceptrons (MLPs) while respecting the causal ordering constraints. Each function is implemented as

f_{i} (\cdot) = {MLP}_{i} (CausalEmbed (Z_{t - 1}) \oplus FactorEmbed (F_{t - 1}))

(16)

where ⊕ denotes concatenation, and the embedding functions ensure proper dimensional alignment and causal temporal dependencies. The MLP architectures are designed with a sufficient capacity to capture nonlinear relationships while avoiding overfitting through appropriate regularization.

Maintaining interpretability while allowing for neural network flexibility requires careful regularization design. The regularization framework includes multiple components:

L_{N S} = ∥ y_{t} (τ) - {\hat{y}}_{t} {(τ) ∥}^{2} + α_{s m o o t h} \sum_{i} ∥ \nabla^{2} f_{i} ∥^{2} + α_{s p a r s e} \sum_{i} {∥ f_{i} ∥}_{1}

(17)

where the smoothness penalty encourages stable factor dynamics that align with economic intuition, and the sparsity penalty promotes interpretable factor loadings. These regularization terms ensure that the enhanced model retains the economic interpretability that makes Nelson–Siegel models valuable for policy analysis.

3.6. Integration with the DoWhy Framework

Rigorous causal inference requires systematic approaches to the identification, estimation, and validation of causal effects. Our integration with the DoWhy framework [10] provides these capabilities within the transformer architecture, enabling both automated causal discovery and robust treatment effect estimation with comprehensive uncertainty quantification.

The causal graph learning component incorporates automated causal discovery through constraint-based algorithms that operate jointly with the neural network training process. The optimization objective becomes

G^{*} = \underset{G}{\arg \min} L_{p r e d i c t i o n} (G) + γ \cdot BIC (G)

(18)

where the Bayesian Information Criterion (BIC) penalizes graph complexity, and

γ

controls the trade-off between the predictive accuracy and causal parsimony. This joint optimization ensures that the learned causal structure supports both accurate prediction and valid causal inference.

Treatment effect estimation leverages multiple identification strategies within the DoWhy framework to provide robust causal effect estimates. For each potential policy intervention, the framework computes

\begin{matrix} {\hat{τ}}_{b a c k d o o r} & = E [E [Y | T = 1, W] - E [Y | T = 0, W]] \end{matrix}

(19)

\begin{matrix} {\hat{τ}}_{I V} & = \frac{Cov (Y, Z)}{Cov (T, Z)} \end{matrix}

(20)

\begin{matrix} {\hat{τ}}_{f r o n t d o o r} & = \sum_{m} E [Y | T = 0, M = m] \cdot P (M = m | T = 1) \end{matrix}

(21)

The availability of multiple estimators provides robustness checks and enables a comprehensive sensitivity analysis.

Uncertainty quantification employs Bayesian neural networks to provide credible intervals for both predictions and causal effects. The posterior distribution over causal effects is expressed as

p (τ | X) = \int p (τ | X, θ) p (θ | X) d θ

(22)

where variational inference approximates the intractable posterior distribution

p (θ | X)

. This approach enables credible intervals for causal effect estimates, providing essential information for policy decision-making under uncertainty.

3.7. Data Preprocessing and Feature Selection

Our experimental framework employs rigorous data preprocessing to ensure temporal consistency and causal identifiability. Raw financial time series undergo standardization using rolling z-scores with a 252-day window to account for time-varying volatility patterns. Outlier detection employs Hampel filters with a 21-day median window and a 3-sigma threshold, replacing extreme values with median-based estimates to preserve the temporal continuity while removing data quality issues.

The feature selection criteria combine economic relevance with causal identifiability requirements. The primary variables include the Federal Funds Rate, the Treasury yields across maturities (3-month through to 30-year), credit spreads, and equity volatility measures. Policy communication features are derived from an FOMC statement text analysis using TF-IDF vectorization and sentiment scoring, capturing the effects of forward guidance and communication tone on the transmission mechanisms.

Temporal alignment procedures ensure consistency in the causal ordering across data sources. High-frequency intraday data are aggregated to daily frequency using volume-weighted averages, while policy announcement effects are captured through narrow event windows. Missing data handling employs forward-filling for weekends and holidays, with adaptive attention masking for genuine data gaps exceeding 5 consecutive observations.

3.8. Integration with the DoWhy Framework

The DoWhy integration operates through three interconnected mechanisms within transformer blocks, enabling real-time causal inference during forward passes. Algorithm 1 details the implementation framework.

Algorithm 1 DoWhy Integration within Transformer Blocks.

1:: Initialize the causal graph $G = (V, E)$
2:: for epoch $e = 1$ to E do
3:: for each transformer block b do
4:: Compute attention weights $A^{(b)} = Attention (Q, K, V)$
5:: Extract embeddings $H^{(b)} = A^{(b)} V$
6:: if $e mod 10 = 0$ then
7:: Update $G$ using the PC algorithm on $H^{(b)}$
8:: Compute the causal structure matrix $S^{(b)}$ from $G$
9:: end if
10:: Apply causal constraints: $A^{(b)} \leftarrow A^{(b)} + λ S^{(b)}$
11:: end for
12:: Estimate the treatment effects using DoWhy modules
13:: end for

Causal graph learning operates every 10 epochs using the PC algorithm with conditional independence testing on transformer embeddings

H^{(b)}

. The learned graph structure generates causal constraint matrices

S^{(b)}

that directly modify the attention computations through additive terms

λ S^{(b)}

. This integration ensures causal consistency without disrupting the gradient flow.

3.9. Scenario Generation and Stress Testing

The framework incorporates comprehensive scenario generation and stress testing capabilities essential for financial risk management and regulatory compliance. These capabilities build upon the causal structure learning to generate realistic policy scenarios and evaluate their potential impacts across different market conditions.

Policy surprise identification employs high-frequency identification strategies that isolate exogenous policy shocks from endogenous market responses. The identification strategy follows

Δ r_{t} = α + β \cdot {Surprise}_{t} + γ \cdot X_{t - 1} + ϵ_{t}

(23)

where policy surprises are identified through narrow event windows around policy announcements. This approach ensures that generated scenarios reflect genuine policy innovations rather than market expectations or systematic responses to economic conditions.

The IRR stress testing framework generates comprehensive Interest Rate Risk scenarios through a structured approach that combines scenario generation with risk metric computation. Scenarios are generated as

{Scenario}_{k} = G (z_{k}, θ_{p o l i c y})

(24)

where

G

represents the neural scenario generation function, and

z_{k}

are random inputs. The risk metrics include Value-at-Risk and Expected Shortfall, computed as

\begin{matrix} {VaR}_{α} & = inf {x : P (Loss \leq x) \geq α} \end{matrix}

(25)

\begin{matrix} {ES}_{α} & = E [Loss | Loss > {VaR}_{α}] \end{matrix}

(26)

These metrics are computed across multiple scenarios to provide a comprehensive risk assessment under different policy environments.

3.10. The Training Procedure and Optimization

The training procedure for CausalFormer requires careful orchestration of multiple objectives that balance predictive accuracy with causal validity and interpretability. The multi-objective optimization framework addresses the fundamental challenge of learning both accurate predictive models and valid causal structures from the same observational data.

The composite loss function integrates multiple objectives:

L_{t o t a l} = L_{p r e d i c t i o n} + λ_{c a u s a l} L_{c a u s a l} + λ_{N S} L_{N S} + λ_{r e g} L_{r e g u l a r i z a t i o n}

(27)

The prediction loss

L_{p r e d i c t i o n}

measures the forecasting accuracy using appropriate metrics for financial time series, such as the mean absolute error for level predictions and directional accuracy for trend predictions. The causal loss

L_{c a u s a l}

enforces consistency between the learned relationships and the identified causal structure, while the Nelson–Siegel loss

L_{N S}

maintains factor interpretability, and the regularization loss

L_{r e g u l a r i z a t i o n}

prevents overfitting and ensures model stability.

The causal consistency loss ensures that learned relationships respect the identified causal structure through two components:

L_{c a u s a l} = \sum_{(i, j) \notin E} max (0, | {Attention}_{i, j} | - ϵ) + \sum_{c y c l e s} {∥ c y c l e ∥}^{2}

(28)

where

ϵ

is a small tolerance parameter. The first component penalizes attention to non-causal relationships, while the second component prevents cycles in the learned graph, ensuring that the model maintains the acyclic property essential for causal interpretation.

The training strategy employs curriculum learning that gradually introduces causal constraints to improve the convergence stability. The causal regularization weight follows

λ_{c a u s a l} (t) = λ_{m a x} \cdot (1 - e^{- α t})

(29)

where t represents the training iteration, and

α

controls the rate of constraint introduction. This approach allows the model to first learn basic temporal patterns before enforcing strict causal constraints, preventing early training instability that can occur when multiple complex constraints are introduced simultaneously.

The complete training procedure integrates gradient-based optimization with causal discovery algorithms through an alternating optimization scheme. Parameter updates optimize the neural network components while maintaining the current causal structure, followed by graph structure refinement, which updates the causal graph based on the current parameter estimates. This alternation continues until the convergence criteria are satisfied for both the predictive performance and the stability of the causal structure, ensuring that the final model achieves both accurate predictions and valid causal interpretation.

4. Experimental Evaluation

This section presents comprehensive experimental validation of CausalFormer across multiple financial datasets and a comparison with established baseline methods. We evaluate both the predictive performance and causal inference capabilities through a rigorous empirical analysis.

4.1. The Experimental Setup

Our experimental framework addresses three core research questions: (1) Does CausalFormer achieve a superior predictive accuracy compared to that of existing time series forecasting methods? (2) Can the framework reliably identify and quantify causal relationships in financial data? (3) How effectively does the model maintain causal consistency while preserving interpretability? The evaluation employs multiple real-world datasets spanning different financial markets and policy regimes to ensure robustness across various economic conditions.

4.2. Data Description and Preprocessing

4.2.1. Dataset Specifications

Our primary monetary policy dataset comprises 6174 daily observations spanning 3 January 2000 to 29 December 2023, encompassing 47 variables, including the Federal Funds Rate, Treasury yields across 11 maturities (3-month, 6-month, 1-year, 2-year, 3-year, 5-year, 7-year, 10-year, 20-year, and 30-year), credit spreads (investment-grade, high-yield), equity volatility measures (VIX, MOVE), and policy communication indicators derived from FOMC statements. The European dataset contains 5892 observations with 32 variables, including ECB key rates, Euro area government bond yields, and policy communication metrics. The high-frequency policy surprise dataset encompasses 193 FOMC announcement events with intraday pricing data captured in 5-min intervals around policy announcements.

Table 1 presents comprehensive descriptive statistics for key variables, demonstrating the substantial variation captured across multiple monetary policy cycles and crisis periods.

4.2.2. Preprocessing Procedures

Missing value handling employed a systematic approach differentiated by data type. Standard market closures (weekends and holidays) affecting 8.3% of observations were addressed through forward-filling to maintain temporal continuity. Genuine data gaps exceeding five consecutive trading days, comprising 0.7% of the observations, were treated using linear interpolation combined with attention masking to prevent spurious pattern learning during neural network training.

Outlier detection and treatment utilized modified Hampel filters with 21-day rolling median windows and 3-sigma thresholds, calibrated to the volatility characteristics of financial time series. This procedure identified and adjusted 1.2% of observations, replacing extreme values with robust estimates while preserving temporal ordering. Policy surprise outliers beyond 4-sigma thresholds were retained, as they represented genuine exogenous shocks essential for causal identification.

Standardization employed rolling z-scores with 252-day windows to accommodate the time-varying volatility patterns characteristic of financial markets. This approach prevents look-ahead bias while ensuring the model inputs remain stationary across different market regimes.

4.2.3. Variable Selection and Justification

The variable selection combined theoretical foundations, statistical validation, and causal identifiability requirements. The primary selection criteria included (1) economic relevance based on established monetary transmission theory, (2) statistical significance through Granger causality tests with a 5% significance threshold, (3) causal identifiability satisfaction of the backdoor criterion for treatment effect estimation, and (4) a data quality assessment ensuring sufficient observation coverage and temporal consistency.

Policy communication variables were constructed through TF-IDF vectorization of the FOMC statement text, extracting 50-dimensional semantic features capturing forward guidance tone and policy uncertainty measures. Sentiment scores were computed using financial domain-specific lexicons, validated against market-based policy uncertainty indices.

4.2.4. The Data Splitting Strategy

Temporal data splitting maintained the chronological ordering essential for causal inference validation. Training data encompasses 60% of observations (January 2000–December 2014, N = 3705), validation data covers 20% (January 2015–December 2018, N = 1235), and testing data comprises 20% (January 2019–December 2023, N = 1234). This allocation ensures the representation of major crisis periods across all splits while providing a sufficient out-of-sample evaluation during recent policy regimes, including unconventional monetary policy and pandemic response measures.

Cross-validation employed am expanding window methodology with a minimum of 2000 observations for initial training, incrementally adding 250-observation blocks for robust hyperparameter optimization. This approach respects temporal dependencies while providing a reliable model selection across varying market conditions.

4.3. Baseline Methods

We compare against established econometric approaches, including VAR models [4] with the lag orders selected via the Bayesian Information Criterion. SVAR models employ long-run restrictions following [5] to identify monetary policy shocks. The DNS model [49] serves as the primary yield curve modeling baseline, estimated via Kalman filtering with maximum likelihood parameter estimation.

The neural network baselines include LSTM networks [61] with attention mechanisms, standard transformer models [6] adapted for time series forecasting, and TFT [57]. The Neural Basis Expansion Analysis for Time Series (N-BEATS) [62] provides a specialized deep learning baseline for univariate forecasting tasks.

For the comparison of causal effect estimation, we implement the Double Machine Learning (DML) framework [29] with random forests and gradient boosting as the base learners. The Structural Agnostic Model (SAM) [63] provides automated causal discovery capabilities, while the PC algorithm [25] serves as a constraint-based causal discovery baseline.

4.4. The Evaluation Metrics

Forecasting accuracy is assessed through multiple horizons (1-day, 1-week, 1-month, 1-quarter) using the Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). Directional accuracy measures the percentage of correctly predicted trend directions. The Diebold–Mariano test [64] evaluates the statistical significance of forecasting improvements.

Causal inference performance is measured using the Average Treatment Effect (ATE) bias, defined as the absolute difference between the estimated and true treatment effects. The Root Mean Squared Error of the treatment effects (RMSE-TE) quantifies estimation precision. The coverage probability of the confidence intervals assesses the quality of uncertainty quantification, with nominal coverage set at 95%.

Structural learning evaluation uses the Structural Hamming Distance (SHD) between the true and estimated causal graphs, the precision and recall of edge detection, and the F1-score combining both measures. The Area Under the Receiver Operating Characteristic curve (AUROC) evaluates the edge existence classification performance.

We introduce a novel causal consistency score measuring the proportion of the attention weights that respect identified causal relationships:

Consistency = \frac{1}{| E |} \sum_{(i, j) \in E} I [{Attention}_{i, j} > τ] + \frac{1}{| \bar{E} |} \sum_{(i, j) \in \bar{E}} I [{Attention}_{i, j} \leq τ]

(30)

where

E

represents true causal edges,

\bar{E}

represents non-edges, and

τ

is a threshold parameter.

4.5. Implementation Details

The CausalFormer implementation uses PyTorch 1.12.0 with an 8-layer transformer architecture, 512-dimensional embeddings, and 8 attention heads. The multi-kernel convolution module employs kernel sizes

{3, 5, 7, 9, 11}

with dilation rates

{1, 2, 4, 8, 16}

. Nelson–Siegel factors use 2-layer MLPs with 256 hidden units each. Training employs the Adam optimizer with the learning rate =

10^{- 4}

, a batch size = 64, and gradient clipping at norm 1.0.

The curriculum learning schedule increases the causal regularization weight

λ_{c a u s a l}

from 0.1 to 1.0 over 50 epochs using exponential scheduling with

α = 0.1

. Early stopping monitors the validation loss with a patience of 10 epochs. Hyperparameter tuning uses Bayesian optimization over 100 trials with 5-fold cross-validation.

DoWhy integration employs the linear regression estimator for backdoor adjustment, two-stage least squares for instrumental variables, and the difference-in-differences estimator for temporal treatments. Causal discovery uses the PC algorithm with conditional independence testing via partial correlation with a significance level

α = 0.05

.

4.6. Results

4.6.1. Enhanced Nelson–Siegel Model Selection

The selection of the Nelson–Siegel model over alternative yield curve models was based on a comprehensive empirical comparison and theoretical considerations. Table 2 presents a performance comparison across candidate models.

The Nelson–Siegel model provides the optimal balance across evaluation dimensions. Economic interpretability through level, slope, and curvature factors enables a transparent policy analysis, while parsimonious parameterization (three factors) facilitates neural network integration without dimensionality issues. The model’s established theoretical foundations support causal factor relationships essential for our framework.

Empirical validation demonstrates the Nelson–Siegel model’s superior regime stability. During the 2008 financial crisis and 2020 pandemic periods, the factor loadings remained stable (coefficient variation < 15%) while cubic splines exhibited parameter instability and PCA factors required re-estimation. Affine term structure models, while theoretically appealing, proved computationally intensive and showed poor convergence in neural optimization.

The enhanced Nelson–Siegel formulation maintains these advantages while adding neural flexibility through factor evolution functions. This hybrid approach achieves am 18% lower RMSE than that of the standard Nelson–Siegel while preserving interpretability, making it optimal for causal analysis in transformer architectures.

4.6.2. Predictive Performance Analysis

Table 3 presents the forecasting performance across all datasets and horizons, with comprehensive statistical significance testing confirming the robustness of the observed improvements. CausalFormer achieves a statistically significant superior performance in all 24 evaluation scenarios, with particularly strong and significant improvements in medium-term forecasting (1-week to 1-month horizons). For Federal Funds Rate predictions, CausalFormer reduces the MAE by 12.7% compared to that of the best baseline (TFT), with statistical significance at the 1% level (p = 0.0023), while improving the directional accuracy by 8.3 percentage points with a 95% confidence interval [5.7%, 11.2%].

The yield curve forecasting results demonstrate CausalFormer’s effectiveness in capturing the term structure dynamics. The model achieves a 15.8% lower RMSE compared to that of DNS models for the 10-year Treasury yields and maintains a superior performance across all maturities. The enhanced Nelson–Siegel decomposition successfully preserves the factor interpretability while improving the predictive accuracy.

4.6.3. Causal Effect Estimation Results

The causal inference evaluation focuses on monetary policy transmission effects using policy surprise instruments. Table 4 shows the treatment effect estimation performance for various policy interventions. CausalFormer achieves a 15.3% lower ATE bias compared to that of the DML baselines and maintains a 93.2% coverage probability for confidence intervals, indicating well-calibrated uncertainty quantification.

The policy transmission analysis reveals heterogeneous effects across different instruments. The conventional changes in interest rates exhibit immediate effects with a 0.8 basis point yield curve impact per a 25 basis point policy change. Quantitative easing programs show delayed but persistent effects, with the peak impact occurring 3–4 weeks after announcements. Forward guidance demonstrates an intermediate transmission speed with significant uncertainty around the effect magnitude.

4.6.4. The Causal Structure Learning Performance

The structural learning evaluation examines CausalFormer’s ability to recover true causal relationships in financial data. Table 5 presents a comprehensive comparison of the structural learning performance across methods. The model achieves an SHD of 2.3 compared to 4.7 for the PC algorithm baseline, indicating superior structural recovery. Precision reaches 87.4% with a recall of 82.1%, yielding an F1-score of 84.7%. The AUROC for edge detection achieved is 0.923, demonstrating a strong discriminative capability.

The cross-validation analysis confirms robustness across different market regimes. The performance remains stable during crisis periods (2008–2009, 2020), though with slightly elevated uncertainty intervals. The model successfully identifies regime-specific transmission mechanisms, such as enhanced bank lending channel effects during quantitative easing periods.

4.6.5. The Causal Consistency Analysis

The causal consistency metric achieved is 91.2% across all experimental settings, confirming that the attention patterns align with the identified causal structure. Figure 3 visualizes the attention weight distributions across different financial variable relationships, demonstrating clear alignment with established financial theory.

The observed attention patterns provide strong evidence for economic interpretability beyond statistical validation. The strong attention weights from short-term rates to long-term yields (Fed Funds to 10-year Treasury: 0.84 average weight) directly reflect the expectations hypothesis, where long-term rates incorporate market expectations of future short-term rate movements. This relationship strengthens during periods of policy uncertainty, with the attention weights increasing to 0.91+ during crisis episodes, capturing the elevated term premium sensitivity to policy signals predicted by affine term structure models.

The term premium dynamic analysis reveals regime-dependent attention patterns that align with the theoretical predictions. During normal market conditions, the attention weights between distant maturities remain modest (typically <0.3), consistent with segmented markets theory, where investor preferences create natural habitat effects. However, during stress periods, the cross-maturity attention strengthens significantly, supporting preferred habitat theory modifications that account for flight-to-quality dynamics and increased cross-market arbitrage activity.

The time-varying nature of the attention weights captures dynamic relationships between yield curve factors consistent with term structure theory. Level factor attention dominates during conventional policy periods, while the slope and curvature factors receive increased attention during quantitative easing episodes, reflecting the differential impact of unconventional policies on the shape of the yield curve. These theoretically grounded patterns validate that our attention mechanism learns economically meaningful relationships rather than spurious statistical correlations.

Ablation studies demonstrate that removing causal constraints reduces the consistency to 73.4% while maintaining a similar predictive performance, highlighting the framework’s ability to encode economic theory without sacrificing accuracy. Attention visualization confirms economically interpretable patterns where the policy communication variables exhibit time-varying attention patterns, with increased focus during periods of unconventional monetary policy, consistent with enhanced policy uncertainty and communication importance during non-standard policy regimes.

4.6.6. Robustness and Sensitivity Analysis

Sensitivity analysis examines the performance across different hyperparameter configurations and dataset characteristics. We conducted comprehensive parameter sweeps to identify the most critical hyperparameters and provide practitioners with clear guidance on the model configuration.

Parameter sensitivity ranking revealed that the causal regularization weight

λ_{c a u s a l}

exhibits the highest sensitivity to the model performance. Table 6 presents a detailed sensitivity analysis across key hyperparameters. The causal regularization weight shows rapid performance degradation outside the optimal range [0.5, 2.0], with the performance dropping by 18.4% at

λ_{c a u s a l} = 0.1

and 15.7% at

λ_{c a u s a l} = 5.0

. This high sensitivity stems from the fundamental tension between expressiveness and interpretability in our architecture, where insufficient constraint enforcement (

λ < 0.5

) leads to the attention patterns violating temporal precedence, while over-constraining (

λ > 2.0

) reduces the model’s ability to capture complex temporal dependencies.

The Nelson–Siegel factor embedding dimension emerged as the second most sensitive parameter, with the optimal performance achieved at 256 dimensions. This sensitivity reflects the fundamental role these factors play in yield curve representation—insufficient dimensionality (⩽128) fails to capture complex term structure dynamics, while excessive dimensionality (⩾512) introduces overfitting into the factor evolution functions. The performance degradation reaches

12.3 %

when the dimensionality drops to 64, confirming the critical nature of proper factor representation.

The multi-kernel convolution dilation rates showed moderate sensitivity, with geometric progression

[1, 2, 4, 8, 16]

proving optimal for capturing multi-scale policy transmission effects. Linear progression patterns reduced the performance by

8.7 %

, demonstrating that exponential receptive field expansion is essential for modeling the hierarchical nature of financial policy transmissions across different time horizons.

Table 7 demonstrates stability across various experimental conditions beyond hyperparameter sensitivity. The results remain stable across the identified optimal hyperparameter ranges, with performance degradation only at extreme values. The model maintains effectiveness with 50% missing data through adaptive attention masking, confirming robustness for real-world deployment scenarios.

The theoretical analysis reveals that

λ_{c a u s a l}

’s high sensitivity reflects the causal constraints operating as regularization that must be carefully balanced to maintain both the predictive power and economic validity. The regularization term

λ_{c a u s a l} L_{c a u s a l}

in our composite loss function directly controls the trade-off between data-driven attention patterns and theoretically motivated causal relationships. This analysis provides practitioners with clear guidance that

λ_{c a u s a l}

requires careful tuning based on the specific balance desired between the predictive accuracy and causal interpretability for their application domain.

Out-of-sample evaluation using 2024 data (the post-training period) confirms the generalization capability within the identified optimal parameter ranges. The performance degrades modestly but remains superior to that of the baselines, with particular robustness in causal effect estimation. This suggests that learned causal structures capture fundamental economic relationships rather than spurious correlations when the parameters are properly configured.

The computational efficiency analysis shows that CausalFormer requires a 2.3× training time compared to that of standard transformers but achieves 40% faster inference due to optimized attention computation. The memory requirements scale linearly with the sequence length, enabling application to extended time series without prohibitive computational costs when the hyperparameters are set within identified optimal ranges.

4.6.7. Economic Interpretation and Policy Implications

The estimated causal effects align with established monetary transmission theory while revealing novel insights about the asymmetric transmission mechanisms. Figure 4 illustrates the temporal dynamics of policy transmission across different instruments, demonstrating three distinct transmission mechanisms with varying speeds and persistence patterns. The model identifies a previously undocumented asymmetric transmission channel where policy tightening exhibits 40% faster propagation than easing, providing empirical support for theoretical predictions in the monetary transmission literature.

Our asymmetric transmission findings align closely with the established empirical literature on monetary policy channels. The 40% faster tightening propagation supports Bernanke and Gertler [65]’s theoretical predictions regarding the bank lending channel, where credit constraints bind more quickly during tightening than they relax during easing. This asymmetry reflects the “pushing on a string” phenomenon, where monetary tightening immediately constrains the lending capacity, while easing requires time for banks to rebuild lending relationships and borrower confidence.

A comparison with existing empirical studies reveals that CausalFormer identifies stronger asymmetric effects than traditional approaches. Romer and Romer (2004)’s narrative approach [31] and Coibion (2012)’s proxy SVAR studies [66] found directional asymmetries of 15–25% faster tightening, smaller than our 40% estimate. Our enhanced temporal resolution and causal identification framework capture transmission speed differences that VAR approaches may underestimate due to temporal aggregation and identification limitations. The framework’s ability to isolate purely exogenous policy variations through high-frequency surprise identification enables the detection of transmission asymmetries obscured in lower-frequency analyses.

The identified asymmetries extend beyond cyclical effects to capture structural transmission differences. Tenreyro and Thwaites [67] documented state-dependent transmissions varying with the business cycle position, while our results persist across interest rate regimes, suggesting fundamental rather than cyclical asymmetries. The transmission speed differences remain significant during both normal and crisis periods, indicating structural features of financial intermediation rather than temporary market conditions.

Our findings contribute to the literature on the zero lower bound by demonstrating that asymmetric transmission mechanisms operate above the lower bound. Eggertsson and Woodford’s theoretical work [68] on lower bound constraints predicted asymmetric policy effectiveness, while our empirical evidence shows that transmission speed asymmetries persist across the entire policy rate spectrum. The financial friction channels identified by Mishkin [69] provide a theoretical foundation for our empirical results, where information asymmetries and agency costs create differential transmission speeds for tightening versus easing policies.

Credit spread reactions show nonlinear dependence on the magnitude of policy surprises, suggesting threshold effects in risk premium adjustments consistent with financial accelerator mechanisms. The framework successfully captures these nonlinear relationships while maintaining economic interpretability, addressing the limitations of linear VAR approaches that assume symmetric transmissions across policy directions and magnitudes.

The impulse response functions displayed in Figure 4 were constructed using our causal framework through a policy surprise identification strategy, where exogenous policy shocks are isolated from endogenous market responses using narrow event windows around FOMC announcements. The confidence intervals were computed through our variational inference framework, providing credible uncertainty quantification for each transmission mechanism. Conventional rate changes exhibit an immediate market response, with the effects materializing within hours and reaching the peak impact within 2–3 days, reflecting efficient market processing of clear policy signals. Quantitative easing programs demonstrate delayed but highly persistent effects, with the impacts building gradually over 2–3 weeks before reaching the maximum transmission after 3–4 weeks, consistent with portfolio rebalancing dynamics as institutions adjust the holdings across asset classes. Forward guidance shows intermediate transmission characteristics, with the effects emerging over 5–10 days as the market participants gradually incorporate policy expectations into yield curve pricing, reflecting the inherent uncertainty in interpreting future policy commitments.

A scenario analysis using the stress testing framework generates realistic interest rate paths for regulatory compliance. The Value-at-Risk estimates achieve a 96.8% backtesting accuracy over held-out periods, meeting the Basel III requirements for internal model validation. Expected Shortfall calculations provide conservative risk estimates suitable for capital adequacy assessment.

The framework’s ability to maintain both predictive accuracy and causal interpretability addresses a fundamental challenge in financial econometrics. Policy makers can leverage both precise forecasts and transparent causal mechanisms for evidence-based decision-making, while financial institutions benefit from interpretable risk models that satisfy the regulatory requirements for model explainability.

4.7. Computational Complexity Analysis

We conducted a comprehensive computational complexity analysis comparing CausalFormer against the baseline methods across the training and inference phases. CausalFormer exhibits

O (T^{2} d + T d^{2})

training complexity, where the

T^{2} d

term arises from the causal attention computations and the

T d^{2}

term arises from multi-kernel convolutions. This compares to standard transformers at

O (T^{2} d)

complexity, showing a modest overhead from the causal constraints.

Table 8 presents detailed timing benchmarks across different sequence lengths and hardware configurations. Training time measurements demonstrate that CausalFormer requires 2.3× longer training than vanilla transformers but achieves 40% faster inference through optimized attention computation and parallel processing of the convolution and Nelson–Siegel modules.

5. Discussion

The results presented in this study have profound implications that extend beyond the immediate technical contributions, fundamentally challenging conventional assumptions in financial econometrics and opening up new research directions in computational finance. This section explores the broader significance of our findings and their potential impact on both academic research and practical applications.

5.1. Theoretical Implications

Our demonstration that causal consistency and predictive accuracy are mutually reinforcing rather than competing objectives represents a paradigm shift in financial modeling. The traditional approaches have long assumed an inherent trade-off between interpretability and performance, leading to the artificial separation of econometric and machine learning methodologies. CausalFormer’s simultaneous achievement of a superior predictive performance and causal validity suggests that incorporating domain knowledge through causal constraints actually enhances rather than limits the model’s expressiveness. This finding has significant implications for the broader field of explainable AI in finance, where regulatory demands for model transparency have often been viewed as impediments to technological advancement.

The mathematical framework we have developed for exploiting architectural symmetries while preserving asymmetric temporal relationships establishes general principles applicable beyond financial applications. The symmetry–asymmetry duality inherent in many complex systems suggests that our approach could benefit domains ranging from climate modeling to biological system analysis. The demonstration that transformer attention mechanisms can be systematically constrained without sacrificing the computational efficiency provides a blueprint for incorporating scientific principles into deep learning architectures.

5.2. Practical Implementation and Regulatory Considerations

The practical deployment of CausalFormer in financial institutions requires careful consideration of the computational infrastructure and regulatory compliance requirements. Our analysis demonstrates that the framework’s 2.3× training time overhead is offset by 40% faster inference and superior interpretability, making it economically viable for production environments. The linear memory scaling with sequence length enables its application to extended time series without prohibitive computational costs, addressing a key concern for high-frequency financial applications.

Regulatory compliance under Basel III and similar frameworks requires models to demonstrate both statistical validity and economic interpretability. CausalFormer’s explicit causal structure learning and uncertainty quantification capabilities address these requirements directly. The framework’s ability to generate realistic stress scenarios with quantified uncertainty bounds meets internal model validation standards, while the interpretable factor decomposition satisfies supervisory review requirements for model explainability.

Integration with the existing risk management systems presents both opportunities and challenges. The framework’s modular architecture allows for gradual adoption, where individual components can be integrated with legacy systems before full-scale deployment. The standardized output format for causal effect estimates facilitates integration with downstream applications such as portfolio optimization and regulatory reporting systems.

5.3. Limitations and Future Research Directions

Despite its contributions, our framework has several limitations that suggest promising research directions. The current implementation focuses on developed market monetary policy where the data quality is high and institutional structures are well established. Extension to emerging markets where structural breaks are more frequent and the data availability is limited presents both technical and methodological challenges.

A multi-country policy coordination analysis represents a natural extension where the causal graph structure could capture cross-border spillover effects and policy interdependencies. The framework’s ability to learn time-varying causal structures makes it well suited to analyzing how global financial integration affects the policy transmission mechanisms across different jurisdictions.

Integration with alternative data sources, particularly textual policy communications and social media sentiment, could enhance the framework’s ability to capture expectation formation dynamics. Natural language processing techniques combined with our causal inference framework could provide unprecedented insights into how policy communication affects market behavior through causal channels beyond the traditional quantitative measures.

The emergence of cryptocurrency and digital asset markets presents novel challenges where traditional econometric approaches may be inadequate due to the absence of established institutional structures and theoretical frameworks. CausalFormer’s data-driven causal discovery capabilities combined with domain-agnostic architecture design make it particularly suited to analyzing these emerging financial ecosystems where causal relationships must be learned directly from data.

Future research could also explore extensions to higher-dimensional financial systems where network effects and systemic risk propagation mechanisms operate through complex causal pathways. The integration of graph neural networks with our causal attention mechanisms could enable an analysis of financial contagion and systemic risk transmission at an unprecedented scale and granularity.

6. Conclusions

This paper establishes a novel mathematical framework that systematically exploits the symmetry properties of transformer architectures while preserving the essential asymmetric temporal relationships in financial causal inference. CausalFormer represents a fundamental advancement in understanding how architectural symmetries can be leveraged for asymmetric pattern discovery in complex temporal systems. Our framework demonstrates that the inherent permutation equivariance of self-attention mechanisms, when combined with carefully designed asymmetric temporal constraints, creates a powerful mathematical foundation for causal discovery in financial time series. Extensive experimental evaluation across multiple real-world datasets demonstrates that CausalFormer achieves a superior performance with a 15.3% improvement in the accuracy of causal effect estimations and a 12.7% enhancement in the predictive performance compared to existing methods while maintaining 91.2% causal consistency scores across policy transmission scenarios. The framework successfully identifies complex monetary policy transmission mechanisms, including previously undocumented asymmetric channels where policy tightening exhibits faster propagation than easing, while providing robust uncertainty quantification essential for regulatory compliance and risk management applications.

The broader implications extend beyond financial applications to any domain where symmetric computational structures must accommodate asymmetric temporal relationships. The mathematical framework establishes general principles for designing neural architectures that respect both the symmetric properties essential for computational efficiency and the asymmetric constraints required for domain-specific validity. This dual preservation of symmetry and asymmetry properties opens up new directions for mathematically principled deep learning across scientific computing applications where temporal precedence and causal ordering are fundamental.

An important avenue for future research involves integrating different concepts of asymmetry within financial markets. Our framework addresses temporal asymmetries in policy transmission speeds, while GARCH-type models capture asymmetric volatility responses to positive versus negative shocks. These represent complementary dimensions of financial asymmetry that could enhance analysis through integration. The mathematical structure of our causal attention mechanism could potentially accommodate volatility-dependent transmission parameters, creating a comprehensive model addressing both transmission speed asymmetries and volatility response asymmetries. This integration would enable simultaneous modeling of mean transmission effects and volatility dynamics, providing deeper insights into how policy uncertainty affects both transmission mechanisms and market volatility patterns.

Future research directions encompass several mathematically rich areas where symmetry–asymmetry relationships remain unexplored. The extension to higher-dimensional symmetric groups and their asymmetric substructures could enable an analysis of complex multi-agent financial systems where symmetric interaction patterns coexist with asymmetric information flows. Investigation of time-varying symmetry breaking mechanisms could capture how symmetric market structures evolve into asymmetric configurations during crisis periods, providing mathematical frameworks for understanding phase transitions in financial systems. The development of quantum-inspired symmetric operations with asymmetric measurement constraints could enable an analysis of quantum finance models where symmetric superposition states collapse into asymmetric market realizations.

Author Contributions

Methodology, W.Z. and W.L.; Software, W.Z. and W.L.; Writing—original draft, W.Z.; Writing—review & editing, W.Z. and W.L.; Supervision, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bernanke, B.S. 21st Century Monetary Policy: The Federal Reserve from the Great Inflation to COVID-19; WW Norton & Company: New York, NY, USA, 2022. [Google Scholar]
Yellen, J. The Goals of Monetary Policy and How We Pursue Them: A Speech at the Commonwealth Club, San Francisco, California, January 18, 2017; Technical report; Board of Governors of the Federal Reserve System (US): Washington, DC, USA, 2017.
Powell, J.H. New Economic Challenges and the Fed’s Monetary Policy Review: A Speech at “Navigating the Decade Ahead: Implications for Monetary Policy”, an Economic Policy Symposium Sponsored by the Federal Reserve Bank of Kansas City, Jackson Hole, Wyoming, 27 August 2020. Technical Report. 2020. Available online: https://www.federalreserve.gov/newsevents/speech/powell20200827a.htm (accessed on 27 August 2020).
Sims, C.A. Macroeconomics and reality. Econom. J. Econom. Soc. 1980, 48, 1–48. [Google Scholar] [CrossRef]
Blanchard, O.J.; Quah, D. The dynamic effects of aggregate demand and supply disturbances. Am. Econ. Rev. 1989, 79, 655–673. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Heaton, J.B.; Polson, N.G.; Witte, J.H. Deep learning for finance: Deep portfolios. Appl. Stoch. Model. Bus. Ind. 2017, 33, 3–12. [Google Scholar] [CrossRef]
Pearl, J. Causality; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Rubin, D.B. Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 1974, 66, 688. [Google Scholar] [CrossRef]
Sharma, A.; Kiciman, E. DoWhy: An end-to-end library for causal inference. arXiv 2020, arXiv:2011.04216. [Google Scholar]
Engle, R.F.; Granger, C.W. Co-integration and error correction: Representation, estimation, and testing. Econom. J. Econom. Soc. 1987, 55, 251–276. [Google Scholar] [CrossRef]
White, H. Economic prediction using neural networks: The case of IBM daily stock returns. In Proceedings of the IEEE 1988 International Conference on Neural Networks (ICNN), San Diego, CA, USA, 24–27 July 1988; Volume 2, pp. 451–458. [Google Scholar]
Rather, A.M.; Agarwal, A.; Sastry, V.N. Recurrent neural network and a hybrid model for prediction of stock returns. Expert Syst. Appl. 2015, 42, 3234–3241. [Google Scholar] [CrossRef]
Nelson, D.M.; Pereira, A.C.; De Oliveira, R.A. Stock market’s price movement prediction with LSTM neural networks. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1419–1426. [Google Scholar]
Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
Feng, F.; He, X.; Wang, X.; Luo, C.; Liu, Y.; Chua, T.S. Temporal relational ranking for stock prediction. ACM Trans. Inf. Syst. (TOIS) 2019, 37, 27. [Google Scholar] [CrossRef]
Feng, F.; Chen, H.; He, X.; Ding, J.; Sun, M.; Chua, T.S. Enhancing stock movement prediction with adversarial training. arXiv 2018, arXiv:1810.09936. [Google Scholar]
Wu, N.; Green, B.; Ben, X.; O’Banion, S. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv 2020, arXiv:2001.08317. [Google Scholar] [CrossRef]
Wiese, M.; Knobloch, R.; Korn, R.; Kretschmer, P. Quant GANs: Deep generation of financial time series. Quant. Financ. 2020, 20, 1419–1440. [Google Scholar] [CrossRef]
Eckerli, F.; Osterrieder, J. Generative adversarial networks in finance: An overview. arXiv 2021, arXiv:2106.06364. [Google Scholar] [CrossRef]
Imbens, G.W.; Rubin, D.B. Causal Inference in Statistics, Social, and Biomedical Sciences; Cambridge University Press: Cambridge, UK, 2015. [Google Scholar]
Pearl, J. Causal diagrams for empirical research. Biometrika 1995, 82, 669–688. [Google Scholar] [CrossRef]
van Amsterdam, W.; Elias, S.; Ranganath, R. Causal inference in oncology: Why, what, how and when. Clin. Oncol. 2025, 38, 103616. [Google Scholar] [CrossRef]
Engler, J.O.; Beeck, J.J.; von Wehrden, H. Mostly harmless econometrics? Statistical paradigms in the ‘top five’ from 2000 to 2018. J. Econ. Methodol. 2025, 32, 14–32. [Google Scholar] [CrossRef]
Spirtes, P.; Glymour, C.N.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Zhang, J. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artif. Intell. 2008, 172, 1873–1896. [Google Scholar] [CrossRef]
Athey, S.; Imbens, G. Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. USA 2016, 113, 7353–7360. [Google Scholar] [CrossRef]
Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef]
Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/debiased machine learning for treatment and structural parameters. Econom. J. 2018, 21, C1–C68. [Google Scholar] [CrossRef]
Angrist, J.D.; Pischke, J.S. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. J. Econ. Perspect. 2010, 24, 3–30. [Google Scholar] [CrossRef]
Romer, C.D.; Romer, D.H. A new measure of monetary shocks: Derivation and implications. Am. Econ. Rev. 2004, 94, 1055–1084. [Google Scholar] [CrossRef]
Gertler, M.; Karadi, P. Monetary policy surprises, credit costs, and economic activity. Am. Econ. J. Macroecon. 2015, 7, 44–76. [Google Scholar] [CrossRef]
Nakamura, E.; Steinsson, J. High-frequency identification of monetary non-neutrality: The information effect. Q. J. Econ. 2018, 133, 1283–1330. [Google Scholar] [CrossRef]
Christoffersen, S.E.; Geczy, C.C.; Musto, D.K.; Reed, A.V. Vote trading and information aggregation. J. Financ. 2007, 62, 2897–2929. [Google Scholar] [CrossRef]
Bradley, D.J.; Jordan, B.D. Partial adjustment to public information and IPO underpricing. J. Financ. Quant. Anal. 2002, 37, 595–616. [Google Scholar] [CrossRef]
Adelino, M.; Schoar, A.; Severino, F. Loan originations and defaults in the mortgage crisis: The role of the middle class. Rev. Financ. Stud. 2016, 29, 1635–1670. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (long and short papers), pp. 4171–4186. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Yoo, J.; Soun, Y.; Park, Y.c.; Kang, U. Accurate multivariate stock movement prediction via data-axis transformer with multi-level contexts. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 2037–2045. [Google Scholar]
Sun, R.; Stefanidis, A.; Jiang, Z.; Su, J. Combining transformer based deep reinforcement learning with Black-Litterman model for portfolio optimization. Neural Comput. Appl. 2024, 36, 20111–20146. [Google Scholar] [CrossRef]
Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; Cottrell, G. A dual-stage attention-based recurrent neural network for time series prediction. arXiv 2017, arXiv:1704.02971. [Google Scholar]
Shih, S.Y.; Sun, F.K.; Lee, H.y. Temporal pattern attention for multivariate time series forecasting. Mach. Learn. 2019, 108, 1421–1441. [Google Scholar] [CrossRef]
Kim, K.j. Financial time series forecasting using support vector machines. Neurocomputing 2003, 55, 307–319. [Google Scholar] [CrossRef]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
Nelson, C.R.; Siegel, A.F. Parsimonious modeling of yield curves. J. Bus. 1987, 60, 473–489. [Google Scholar] [CrossRef]
Svensson, L.E. Estimating and interpreting forward interest rates: Sweden 1992–1994. Sveriges Riksbank Q. Rev. 1995, 3, 13–26. [Google Scholar]
Diebold, F.X.; Li, C. Forecasting the term structure of government bond yields. J. Econom. 2006, 130, 337–364. [Google Scholar] [CrossRef]
Bauer, M.D.; Neely, C.J. International channels of the Fed’s unconventional monetary policy. J. Int. Money Financ. 2014, 44, 24–46. [Google Scholar] [CrossRef]
Exterkate, P.; Groenen, P.J.; Heij, C.; van Dijk, D. Nonlinear forecasting with many predictors using kernel ridge regression. Int. J. Forecast. 2016, 32, 736–753. [Google Scholar] [CrossRef]
Richman, R.; Scognamiglio, S. Multiple yield curve modeling and forecasting using deep learning. ASTIN Bull. J. IAA 2024, 54, 463–494. [Google Scholar] [CrossRef]
Shanmugam, R. Elements of causal inference: Foundations and learning algorithms. J. Stat. Comput. Simul. 2018, 88, 3248. [Google Scholar] [CrossRef]
Schölkopf, B. Causality for machine learning. In Probabilistic and Causal Inference: The Works of Judea Pearl; Association for Computing Machinery: New York, NY, USA, 2022; pp. 765–804. [Google Scholar]
Wang, T.; Zhou, C.; Sun, Q.; Zhang, H. Causal attention for unbiased visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3091–3100. [Google Scholar]
Sui, Y.; Wang, X.; Wu, J.; Lin, M.; He, X.; Chua, T.S. Causal attention for interpretable and generalizable graph classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington DC, USA, 14–18 August 2022; pp. 1696–1705. [Google Scholar]
Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Goudet, O.; Kalainathan, D.; Caillou, P.; Guyon, I.; Lopez-Paz, D.; Sebag, M. Causal generative neural networks. arXiv 2017, arXiv:1711.08936. [Google Scholar]
Khemakhem, I.; Kingma, D.; Monti, R.; Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, Virtual, 26–28 August 2020; pp. 2207–2217. [Google Scholar]
Vowels, M.J.; Camgoz, N.C.; Bowden, R. D’ya like dags? A survey on structure learning and causal discovery. ACM Comput. Surv. 2022, 55, 82. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
Kalainathan, D.; Goudet, O.; Guyon, I.; Lopez-Paz, D.; Sebag, M. Sam: Structural agnostic model, causal discovery and penalized adversarial learning. arXiv 2018, arXiv:1803.04929. [Google Scholar]
Diebold, F.X.; Mariano, R.S. Comparing predictive accuracy. J. Bus. Econ. Stat. 2002, 20, 134–144. [Google Scholar] [CrossRef]
Bernanke, B.S.; Gertler, M. Inside the Black Box: The Credit Channel of Monetary Policy Transmission. J. Econ. Perspect. 1995, 9, 27–48. [Google Scholar] [CrossRef]
Coibion, O. Are the Effects of Monetary Policy Shocks Big or Small? Am. Econ. J. Macroecon. 2012, 4, 1–32. [Google Scholar] [CrossRef]
Tenreyro, S.; Thwaites, G. Pushing on a String: US Monetary Policy Is Less Powerful in Recessions. Am. Econ. J. Macroecon. 2016, 8, 43–74. [Google Scholar] [CrossRef]
Eggertsson, G.B.; Woodford, M. The Zero Bound on Interest Rates and Optimal Monetary Policy. Brookings Pap. Econ. Act. 2003, 2003, 139–235. [Google Scholar] [CrossRef]
Mishkin, F.S. The Channels of Monetary Transmission: Lessons for Monetary Policy. In NBER Working Paper; No. 5464; National Bureau of Economic Research: Cambridge, MA, USA, 1996. [Google Scholar]

Figure 1. CausalFormer architecture overview. The framework processes financial time series through multiple transformer layers, each incorporating causal mechanisms in parallel computational streams. Within each transformer block, the multi-kernel causal convolution module and enhanced Nelson–Siegel decomposition layer operate concurrently on attended representations, generating separate feature streams that capture policy transmission effects and yield curve factor dynamics, respectively. These parallel streams are subsequently fused through learned attention weights before layer normalization and residual connections. The architecture integrates causal self-attention for temporal dependencies, enabling simultaneous processing of multi-scale temporal patterns and structural factor relationships while maintaining computational efficiency through specialized parallel processing paths.

Figure 2. The causal self-attention mechanism. The diagram illustrates how temporal priority constraints and the causal structure influence attention weight computation. The mechanism ensures that attention flows only from causally valid predecessors while incorporating learned causal relationships through adaptive weighting.

Figure 3. Attention weight patterns in CausalFormer reflecting financial theory. The heatmap shows the attention weights between different financial variables, with darker colors indicating stronger attention weights (scale 0–1). The patterns explicitly reflect core financial theories: strong attention weights of 0.73–0.89 between Fed Funds rates and long-term yields demonstrate the expectations hypothesis, where the long-term rates incorporate the expectations of future short-term rate movements. Cross-maturity attention patterns vary with market regimes, strengthening to 0.91+ during crisis periods, reflecting elevated term premium sensitivity to policy signals, consistent with affine term structure models. Weak attention between distant maturities during normal periods (weights < 0.3) aligns with segmented market theory, while stronger cross-maturity attention during stress periods supports preferred habitat theory modifications. Temporal precedence is strictly maintained with no future-looking attention weights, ensuring causal consistency.

Figure 4. Policy transmission effects across time horizons. The figure shows impulse response functions for different monetary policy instruments: conventional rate changes (blue), quantitative easing (red), and forward guidance (green). The x-axis represents days after policy announcements, and the y-axis shows the cumulative effect on 10-year Treasury yields at basis points. Shaded areas represent the 95% confidence intervals. Notable features include the immediate response to rate changes, delayed but persistent QE effects, and an intermediate transmission speed for forward guidance.

Table 1. Descriptive statistics for key variables.

Variable	N	Mean	Std. Dev.	Min	Max	Skewness	Kurtosis
Fed Funds Rate (%)	6174	2.34	2.18	0.05	6.50	0.89	2.41
10Y Treasury (%)	6174	3.12	1.47	0.51	6.03	0.34	1.89
2Y Treasury (%)	6174	2.45	1.98	0.09	5.85	0.67	2.15
Credit Spread (bps)	6174	187.3	156.2	73.0	2201.0	3.45	18.7
VIX	6174	20.1	8.9	9.1	82.7	2.34	10.8
Policy Surprise (bps)	193	0.2	8.4	−43.0	37.5	−0.12	4.67

Table 2. Yield curve model comparison.

Model	RMSE	Interpretability	Causal Integration	Stability
Cubic Splines	0.0234	Low	Poor	Medium
PCA (3 factors)	0.0298	Medium	Poor	High
Affine Term Structure	0.0267	High	Medium	Low
Nelson–Siegel	0.0241	High	High	High
Enhanced NS	0.0198	High	High	High

Table 3. Forecasting performance comparison with statistical significance.

Method	1-Day MAE	1-Week MAE	1-Month MAE	1-Quarter MAE	Directional Acc.
VAR	0.0847	0.1523	0.2341	0.3456	64.2%
LSTM	0.0734	0.1298	0.2087	0.3201	67.8%
Transformer	0.0678	0.1156	0.1893	0.2934	69.5%
TFT	0.0651	0.1134	0.1867	0.2891	71.2%
N-BEATS	0.0698	0.1203	0.1924	0.2987	68.9%
CausalFormer	0.0589 ***	0.0991 ***	0.1630 ***	0.2524 ***	79.5% ***

*** indicates statistical significance at the 1% level (the Diebold–Mariano test). All improvements significant in Model Confidence Set at 95% confidence.

Table 4. Causal effect estimation performance.

Method	ATE Bias	RMSE-TE	Coverage	CI Width
DML (RF)	0.0234	0.0456	89.4%	0.1832
DML (GB)	0.0221	0.0443	90.7%	0.1798
SAM	0.0267	0.0498	87.3%	0.1956
PC + Regression	0.0298	0.0523	85.1%	0.2134
CausalFormer	0.0187	0.0381	93.2%	0.1567

Table 5. Causal structure learning performance comparison.

Method	SHD	Precision	Recall	F1-Score	AUROC
PC Algorithm	4.7	73.2%	68.9%	71.0%	0.845
SAM	3.9	78.1%	74.3%	76.2%	0.867
GES	4.2	75.6%	71.8%	73.6%	0.856
NOTEARS	3.6	79.8%	76.2%	78.0%	0.883
CausalFormer	2.3	87.4%	82.1%	84.7%	0.923

Table 6. Parameter sensitivity analysis.

Parameter	Optimal Value	Sensitivity Range	Performance Drop
$λ_{c a u s a l}$	1.0	[0.5, 2.0]	18.4% (at 0.1)
NS Embedding Dim	256	[128, 512]	12.3% (at 64)
Dilation Rates	[1, 2, 4, 8, 16]	Geometric	8.7% (linear)
Learning Rate	$10^{- 4}$	[5 × 10⁻⁵, 2 × 10⁻⁴]	6.2% (at $10^{- 3}$ )
Attention Heads	8	[4, 16]	4.9% (at 2)

Table 7. Robustness analysis results.

Condition	MAE Change	Causal Consistency	F1-Score
Baseline	0.0%	91.2%	84.7%
$λ_{c a u s a l} = 0.5$	+2.1%	89.8%	83.2%
$λ_{c a u s a l} = 2.0$	+1.7%	92.4%	85.1%
50% Missing Data	+8.3%	87.6%	81.9%
Crisis Period Only	+5.4%	89.1%	82.7%
Small Sample (n = 1000)	+12.6%	85.3%	79.4%

Table 8. Computational complexity comparison.

Method	Training Time	Inference Time	Memory (GB)	Parameters (M)	Complexity
VAR	0.3 h	2 ms	0.1	0.05	$O (T d^{2})$
LSTM	1.2 h	15 ms	2.1	8.4	$O (T d^{2})$
Transformer	2.1 h	20 ms	3.8	12.7	$O (T^{2} d)$
TFT	3.4 h	18 ms	4.2	15.3	$O (T^{2} d)$
CausalFormer	4.7 h	12 ms	4.9	18.6	$O (T^{2} d + T d^{2})$

Benchmark: T = 2000, d = 50, NVIDIA V100 GPU.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, W.; Liu, W. Symmetry-Aware Transformers for Asymmetric Causal Discovery in Financial Time Series. Symmetry 2025, 17, 1591. https://doi.org/10.3390/sym17101591

AMA Style

Zheng W, Liu W. Symmetry-Aware Transformers for Asymmetric Causal Discovery in Financial Time Series. Symmetry. 2025; 17(10):1591. https://doi.org/10.3390/sym17101591

Chicago/Turabian Style

Zheng, Wenxia, and Wenhe Liu. 2025. "Symmetry-Aware Transformers for Asymmetric Causal Discovery in Financial Time Series" Symmetry 17, no. 10: 1591. https://doi.org/10.3390/sym17101591

APA Style

Zheng, W., & Liu, W. (2025). Symmetry-Aware Transformers for Asymmetric Causal Discovery in Financial Time Series. Symmetry, 17(10), 1591. https://doi.org/10.3390/sym17101591

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Aware Transformers for Asymmetric Causal Discovery in Financial Time Series

Abstract

1. Introduction

2. Related Work

2.1. Traditional Econometric Approaches in Financial Time Series

2.2. Deep Learning in Financial Applications

2.3. Causal Inference in Econometrics and Machine Learning

2.4. Causal Inference in Financial Markets

2.5. Transformer Architectures and Financial Applications

2.6. Yield Curve Modeling and Term Structure Analysis

2.7. The Integration of Causal Inference and Deep Learning

2.8. Gaps in the Existing Literature

3. Methodology

3.1. The Theoretical Framework

3.2. An Overview of the CausalFormer Architecture

3.3. The Causal Self-Attention Mechanism

3.4. The Multi-Kernel Causal Convolution Module

3.5. The Enhanced Nelson–Siegel Decomposition Layer

3.6. Integration with the DoWhy Framework

3.7. Data Preprocessing and Feature Selection

3.8. Integration with the DoWhy Framework

3.9. Scenario Generation and Stress Testing

3.10. The Training Procedure and Optimization

4. Experimental Evaluation

4.1. The Experimental Setup

4.2. Data Description and Preprocessing

4.2.1. Dataset Specifications

4.2.2. Preprocessing Procedures

4.2.3. Variable Selection and Justification

4.2.4. The Data Splitting Strategy

4.3. Baseline Methods

4.4. The Evaluation Metrics

4.5. Implementation Details

4.6. Results

4.6.1. Enhanced Nelson–Siegel Model Selection

4.6.2. Predictive Performance Analysis

4.6.3. Causal Effect Estimation Results

4.6.4. The Causal Structure Learning Performance

4.6.5. The Causal Consistency Analysis

4.6.6. Robustness and Sensitivity Analysis

4.6.7. Economic Interpretation and Policy Implications

4.7. Computational Complexity Analysis

5. Discussion

5.1. Theoretical Implications

5.2. Practical Implementation and Regulatory Considerations

5.3. Limitations and Future Research Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI