1. Introduction
Forecasting volatility and managing tail risk across a multi-asset universe is one of the central problems in quantitative finance. Classical univariate models such as GARCH [
1] and HAR [
2] deliver parsimonious point forecasts of conditional variance, while multivariate extensions such as DCC–GARCH [
3] and the Diebold–Yilmaz spillover index [
4] provide useful descriptive measures of cross-asset connectedness. These models, however, rely on linear, contemporaneous covariance structures, which cannot capture the directional, lagged, and nonlinear information flow that characterises modern financial markets and that empirical studies of crisis transmission have repeatedly documented [
5].
From an information-theoretic viewpoint, the natural object that summarises directional dependence between two stochastic processes is the transfer entropy of Schreiber [
6], which generalises Granger causality to nonlinear and non-Gaussian settings. Empirical applications to financial time series have shown that transfer entropy detects regime changes and contagion episodes that linear measures miss [
7]. Estimating transfer entropy from finite samples is nontrivial; the nearest-neighbour estimator of Kraskov, Stögbauer, and Grassberger [
8] provides a bias-corrected plug-in estimator with favourable finite-sample properties, but to the best of our knowledge it has not been combined with modern graph deep learning in a unified probabilistic framework that delivers uncertainty quantification and risk-aware portfolio rules.
In parallel, graph neural networks have emerged as a powerful tool for relational learning [
9,
10,
11], with attention-based variants such as the Graphormer [
12] and the graph transformer [
13] matching or surpassing convolutional counterparts on a range of benchmarks. In finance, recent applications include stock prediction with temporal relational graphs [
14,
15] and relational learning for credit risk [
16]. These efforts typically build the graph from precomputed correlations or sector membership, losing the directional information that transfer entropy would supply.
A second methodological pillar is the variational information bottleneck (VIB) of Alemi et al. [
17], which extends the information bottleneck principle to deep networks by minimising the mutual information
between input
X and latent
Z subject to a constraint on the predictive information
, and which provides a tractable variational upper bound that has been linked to generalisation in deep networks [
18,
19]. Despite the active literature on information-theoretic representation learning [
20], the combination of a VIB-regularised graph encoder with transfer entropy edges has not, to our knowledge, been studied in the context of multi-asset volatility forecasting and portfolio construction.
A third pillar is risk-aware portfolio construction. Conditional Value at Risk (CVaR) provides a coherent risk measure that admits a convex reformulation [
21], and entropy-based portfolio construction [
22] has shown that information-theoretic objectives can improve diversification beyond mean–variance. We unify these ideas: the predictive entropy from a calibrated forecaster acts as a position scaling signal, while the Kullback–Leibler divergence between the bottleneck posterior and a structured prior measures model risk and enters a CVaR-constrained second-order cone programme.
Three desiderata follow from this discussion. The model should build the inter-asset graph from directional, time-varying information measures rather than from sample correlations; the encoder should compress the input to a minimal sufficient representation, so that overfitting is controlled and the predictive distribution is well calibrated; and the resulting predictive distribution should be operationalised into a portfolio rule that respects explicit tail-risk constraints. The proposed TDV framework satisfies all three.
The primary contribution of this paper is the TDV forecasting model, which captures directional, nonlinear volatility spillovers with calibrated uncertainty through the transfer entropy graph, the graph attention encoder, and the variational information bottleneck (
Section 2.2,
Section 2.3 and
Section 2.4). To demonstrate the practical value of the calibrated predictive distribution, we present a downstream portfolio application in which the uncertainty estimates are consumed by an entropy-regulated, CVaR-constrained allocation rule (
Section 4). The CVaR-constrained programme itself employs the standard convex formulation of Rockafellar and Uryasev [
21]; the contribution of
Section 4 lies not in the optimisation technique but in the information-theoretic inputs—the predictive entropy penalty and the KL-based model risk measure—that the TDV forecaster supplies. The theoretical results (Theorems 1–3) cover both the forecasting model and its downstream application, confirming that calibrated uncertainty propagates correctly into tail-risk control.
Information-theoretic financial econometrics is by now a mature field [
7,
22], yet contemporary deep learning systems rarely treat differential entropy and KL divergence as first-class operational signals. The present paper closes that gap by coupling a graph attention forecaster with a VIB regulariser and by deriving theoretical guarantees that justify the use of the predictive entropy in the downstream allocation step.
Turning to deep learning for finance, the survey of [
23] catalogued several hundred neural network applications to forecasting and trading; empirical studies by Gu, Kelly, and Xiu [
24] and Fischer and Krauss [
25] demonstrated that machine learning materially outperforms classical models for the cross section of returns and that deep hedging [
26,
27] extends the methodology to derivative risk management. These approaches enlarge the function class but typically output point estimates without principled uncertainty quantification. Bayesian neural networks [
28], Monte Carlo dropout [
29], and deep ensembles [
30] approximate posteriors but typically lack coverage guarantees. The VIB layer adopted here provides closed-form predictive entropies under the Gaussian encoder, sidestepping the sampling cost while retaining a principled information-theoretic interpretation of regularisation.
Table 1 positions the proposed framework against representative prior studies along seven axes, including the theoretical guarantees that distinguish our contribution.
Table 1 shows that no single existing approach addresses all seven dimensions jointly. The contributions of this paper close that gap. The scope of the claimed contribution is deliberately circumscribed: transfer entropy estimation, graph attention, the variational information bottleneck, and CVaR optimisation are individually established methodologies, and the portfolio programme follows the standard convex formulation of Rockafellar and Uryasev [
21]. The contribution lies in the integrated architecture, in the structural by-products that the integration uniquely enables (the entropy decomposition, the KL-based model risk measure, and the entropy-penalised allocation rule), and in the three theoretical results (Theorems 1–3) that certify the end-to-end pipeline. The three structural interactions that the integration creates are detailed below.
The contributions form a connected progression. The constituent components, transfer entropy estimation, graph attention networks, the variational information bottleneck, and CVaR optimisation, are individually established.
The novelty resides not in the individual components but in three structural interactions that the integrated architecture creates and that no proper subset of the ingredients can reproduce. The novelty manifests in three structural interactions. At the graph level, the injection of transfer entropy weights into the attention logits (Equation (
9), via the learnable mixing coefficient
) endows the graph attention mechanism with a directional, nonlinear information-theoretic prior on the message passing geometry; this prior is absent from standard graph attention networks that learn attention from node features alone, and it cannot be recovered by post hoc thresholding of a learned attention matrix, because the transfer entropy signal shapes the gradient landscape during training rather than merely filtering the output. At the representation level, the variational information bottleneck serves a dual role that is unique to this architecture: it simultaneously regularises the encoder through the PAC–Bayes bound (Theorem 2), whose complexity term is the very KL penalty minimised during training, and it produces a closed-form Gaussian predictive distribution whose differential entropy decomposes into aleatoric and epistemic components (Proposition 2); this dual role is specific to the Gaussian bottleneck placed after the graph attention layers and does not arise in generic Bayesian neural networks, Monte Carlo dropout, or deep ensembles, all of which require sampling-based entropy estimates without closed-form decompositions. At the allocation level, the predictive entropy and the inter-specification KL divergence furnish information-theoretic inputs to the CVaR-constrained allocation that conventional plug-in variance estimates cannot supply: the entropy penalty (Equation (
23)) scales each position by a measure of the model’s own calibrated uncertainty rather than by a point-estimate risk premium, and the CVaR feasibility guarantee (Theorem 3) ensures that the encoder’s calibration propagates to the tail-risk constraint, closing a theoretical loop between representation learning and portfolio feasibility that remains open when the two stages are treated as independent modules. Specifically, the proposed TDV framework:
- 1.
Builds a directed, time-varying graph whose edge weights are bias-corrected
k-nearest-neighbour estimates of pairwise transfer entropy on a rolling window, supplying directional and nonlinear information that Pearson and Spearman correlation networks miss (
Section 2.2).
- 2.
Learns node embeddings through a multi-head graph attention encoder whose attention scores are augmented by the transfer entropy weights, so that the information flow estimated in the data shapes the message passing geometry (
Section 2.3).
- 3.
Compresses the resulting representation through a variational information bottleneck layer that minimises the mutual information
while maximising the predictive mutual information
, with a closed-form Gaussian posterior that yields tractable predictive entropy and KL divergence (
Section 2.4).
- 4.
Establishes three theoretical results: consistency and an rate for the k-nearest-neighbour transfer entropy estimator under -mixing (Theorem 1), a PAC–Bayes generalisation bound for the bottleneck-encoded graph attention forecaster (Theorem 2), and asymptotic feasibility of the CVaR-constrained allocation under the calibrated predictive distribution (Theorem 3). Each result addresses a gap in the existing literature that is not closed by the constituent techniques alone. Theorem 1 extends the KSG consistency proof from i.i.d. samples to -mixing processes via Berbee’s coupling construction, yielding a convergence rate tailored to the serial dependence structure of financial returns. Theorem 2 derives a PAC–Bayes bound whose complexity term is the sample-averaged KL divergence of a graph-structured VIB encoder, linking the information bottleneck penalty directly to the generalisation gap in a non-i.i.d. setting. Theorem 3 completes the chain by proving that the convergence of the encoder’s predictive moments propagates through the CVaR functional to guarantee asymptotic constraint satisfaction, a result that requires the joint analysis of the encoder, the shrinkage estimator, and the portfolio optimiser and does not follow from any one component in isolation.
- 5.
Demonstrates the practical utility of the calibrated predictions through a downstream portfolio application, in which the predictive differential entropy modulates each position via an uncertainty-aware exponential penalty, the KL divergence between the bottleneck posterior and a structured prior measures model risk, and a standard CVaR constraint is enforced as a second-order cone programme (
Section 4).
The integrative point is that combining transfer entropy edges, graph attention, variational information bottleneck, and CVaR optimisation within one semiparametric model yields structural by-products, namely an entropy decomposition of forecast uncertainty, KL bounds on cross-specification divergence, and an entropy-penalised allocation rule that are simply not available when the ingredients are deployed separately. Only the full pipeline delivers all three: interpretable directional attention, closed-form entropy with an aleatoric–epistemic split, and a provable CVaR feasibility certificate, as established by the three structural interactions described above. The ablation study supplies direct empirical support for this claim: removing transfer entropy edges in favour of Pearson correlation raises MSFE by 33 percent, removing the VIB layer () widens the gap between empirical and nominal coverage to 11 percentage points, and replacing the entropy modulation with a flat momentum signal reduces the Sharpe ratio by 0.23. Each degradation is attributable to exactly one missing link in the pipeline and cannot be compensated by the remaining components, confirming that the three structural interactions are individually necessary.
The experimental design tests the integrative claim. We report simulation studies under three canonical data generating processes (sparse Granger networks, contagion DCC–GARCH ensembles, and regime-switching factor models), adversarial misspecification studies, finite-sample calibration diagnostics, sub-period robustness on a global multi-asset panel, baseline hyperparameter tuning protocols, and a transaction cost sensitivity analysis. The proposed framework achieved the lowest mean squared forecasting error in every scenario, attained 94.2 percent empirical coverage of nominal 95 percent prediction intervals, and delivered a 1.46 annualised Sharpe ratio on the real data panel, with CVaR reductions of 28 to 36 percent over a minimum-variance benchmark and 22 to 28 percent over a vanilla graph attention baseline.
The remainder of the paper proceeds as follows.
Section 2 constructs the TDV framework, covering the transfer entropy graph construction, the graph attention encoder, the variational information bottleneck, the joint training objective, and the supporting theoretical results.
Section 3 develops the information-theoretic analysis, including the predictive entropy decomposition and its link to spillover and tail risk.
Section 4 presents a downstream portfolio application based on entropy-regulated, CVaR-constrained allocation that demonstrates the practical value of the calibrated predictions.
Section 5 presents the simulation and real data experiments.
Section 6 discusses implications and limitations, and
Section 7 concludes.
6. Discussion
The proposed framework brings together three methodological streams. Deep graph learning approaches yield expressive nonlinear forecasters but rely on correlation or pre-specified graphs [
9,
10]; TDV upgrades the graph to a directional, time-varying transfer entropy network without sacrificing the attention-based inductive bias of modern GNNs. Information-theoretic representation learning techniques such as the variational information bottleneck [
17] deliver calibrated representations and generalisation guarantees [
18,
19] to bound the generalisation gap of the forecaster. Tail risk-aware portfolio construction [
21] reaches its full potential when the predictive distribution is reliable; the downstream portfolio application in
Section 4 feeds the calibrated Gaussian predictive from TDV into a standard CVaR-constrained second-order cone programme, demonstrating that the information-theoretic inputs materially improve tail-risk control relative to conventional plug-in estimates.
Adopting differential entropy, mutual information, and Kullback–Leibler divergence as operational signals departs from variance-based measures and offers three benefits. The entropy decomposition (Proposition 2) separates aleatoric and epistemic uncertainty in information-theoretic units, enabling cross-asset comparisons. The mutual information bound (Proposition 3) provides a hard ceiling on spillover prediction quality that allows us to benchmark how much of the available information the encoder actually extracts. The connection to the PAC–Bayes bound (Theorem 2) supplies a generalisation guarantee that variance-based methods lack, extending entropy-based financial econometrics [
22] from a descriptive to a prescriptive register.
The framework crosses the boundaries of deep learning, graph neural networks, and information theory. The principal contribution is the TDV forecasting model itself; the portfolio application in
Section 4 serves as a downstream demonstration that the calibrated predictive entropy can be used as an operational position sizing signal within a standard CVaR-constrained programme. The theoretical results are deliberately focused on the three operations that justify the entropy-regulated allocation, transfer entropy consistency, encoder generalisation, and CVaR feasibility, and they provide targeted guarantees that would not be available if the components were used in isolation. Beyond the empirical gains, the integrated architecture introduces three structural interactions (transfer entropy attention modulation, closed-form VIB entropy decomposition, and the Theorems 1–3 certificate chain) that are unavailable when the components are deployed in isolation; the detailed argument is given in
Section 1 and the ablation evidence in
Table 10.
Several limitations warrant attention. The consistency result (Theorem 1) presumed
-mixing of the joint return process; while this holds for standard GARCH and stochastic volatility models, it may fail under structural breaks not captured by the assumed regimes. To assess whether the condition was empirically plausible on the real-data panel, we estimated the
-mixing coefficients from the sample autocorrelation function of absolute returns: the estimated coefficients decayed geometrically with a half-life of approximately 12 trading days across all 32 assets, and the summability condition
was satisfied for
in all cases. During the three sub-periods, including the COVID-19 shock (March 2020), the Ukraine conflict (February 2022), and the 2022–2023 rate-tightening cycle, the estimated mixing coefficients increased by a factor of 2–3 but remained summable, suggesting that the
-mixing framework accommodated these episodes as transient deviations rather than permanent structural breaks. A formal test of mixing under regime change, for instance via the adaptive block bootstrap of Politis and Romano, would strengthen this evidence and is left for future work. The generalisation bound (Theorem 2) depends on the time-averaged KL divergence
, which upper-bounds the true mutual information
and inherits the looseness of the Gaussian variational approximation. In practice, the gap between the variational KL and a MINE-based point estimate of
ranged from 0.8 to 1.5 nats across the three simulated DGPs (
Figure 3), indicating that the bound was informative but not tight. Tightening the bound by adopting a more expressive variational family (e.g., normalising flows) or by computing the exact rate–distortion function is an interesting theoretical direction. The CVaR feasibility result (Theorem 3) requires sub-Gaussianity of the realised portfolio return; as discussed in Remark 5, the Cornish–Fisher correction adjusts the implemented quantile for skewness and kurtosis, and the diversification induced by the box constraints reduces the portfolio tail index to a sub-exponential regime, so that the feasibility conclusion extends in practice beyond the stated sufficient condition. Nevertheless, a formal extension to sub-exponential or stable laws would further align with heavy-tail empirical evidence. In addition, we currently rely on independent encoders across assets; a multi-output graph attention version would exploit cross-asset correlation in the latent space and is a natural extension. The empirical universe of 32 assets does not capture intraday microstructure; higher-frequency extensions to limit order book data would benefit from the Mamba state space machinery [
39] in place of the Transformer encoder. The entropy-regulated allocation could be amortised through an end-to-end deep portfolio policy in the style of [
26,
40], which would replace the explicit SOCP with a learned actor; the resulting policy would lose some of the convex guarantees but might benefit from richer state information. Transfer entropy at fixed lag could be replaced by a spectral counterpart [
5], capturing frequency-specific spillover that is informative for both short-term and long-term investors.