Interpretable Deep Learning for REIT Return Forecasting: A Comparative Study of LSTM, TVP–VAR Proxy, and SHAP-Based Explanations

Suprihadi, Eddy; Danila, Nevi; Ali, Zaiton; Ananta, Gede Pramudya

doi:10.3390/ijfs14030073

Open AccessArticle

Interpretable Deep Learning for REIT Return Forecasting: A Comparative Study of LSTM, TVP–VAR Proxy, and SHAP-Based Explanations

¹

Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM), Batu Pahat 86400, Malaysia

²

Finance Department, College of Business Administration, Prince Sultan University, Riyadh 11586, Saudi Arabia

³

Fakulti Pendidikan Teknikal Dan Vokasional, Universiti Tun Hussein Onn Malaysia (UTHM), Batu Pahat 86400, Malaysia

^*

Author to whom correspondence should be addressed.

Int. J. Financial Stud. 2026, 14(3), 73; https://doi.org/10.3390/ijfs14030073

Submission received: 16 January 2026 / Revised: 6 March 2026 / Accepted: 9 March 2026 / Published: 12 March 2026

(This article belongs to the Special Issue Advances in Financial Econometrics)

Download

Browse Figures

Versions Notes

Abstract

Forecasting returns in Real Estate Investment Trust (REIT) markets remains challenging because REIT performance is shaped by nonlinear and time-varying interactions with macro-financial conditions. This study evaluates the forecasting performance of Long Short-Term Memory (LSTM) neural networks relative to a TVP–VAR proxy implemented as an expanding window VAR for weekly U.S. U.S. REIT returns. All models are assessed within a harmonized experimental framework that applies consistent data preprocessing, feature construction, and strictly time-ordered out-of-sample evaluation. The results indicate that the baseline LSTM model delivers modest but more stable error-based performance than the TVP–VAR proxy, with improvements concentrated in RMSE and MAE, while evidence for directional predictability is weak and not consistently distinguishable from benchmark performance. To enhance transparency, SHapley Additive exPlanations (SHAPs) are used to interpret the LSTM forecasts. The attribution analysis highlights recent REIT returns, global equity indicators—particularly the Hang Seng Index—and crude oil prices as influential predictors, and shows that their contributions vary across volatility regimes, consistent with time-varying spillovers and changing risk transmission. Overall, the study positions LSTM forecasting combined with SHAP-based interpretation as a transparent and reproducible framework for comparative evaluation and driver analysis in weekly REIT returns, rather than as a strong directional timing tool.

Keywords:

REIT forecasting; LSTM; TVP–VAR proxy; SHAP; explainable AI; financial econometrics; U.S. real estate markets

1. Introduction

Real Estate Investment Trusts (REITs) are widely used in investment portfolios because they allow investors to gain exposure to real estate without giving up the liquidity and regulatory protections found in equity markets (Danila, 2025; Glickman, 2014; Gogineni et al., 2024). In practice, however, REIT returns are not easy to anticipate. They respond to many macro-financial conditions at once, including movements in interest rates, changes in global sentiment, shifts in commodity prices, and the way risk spreads across markets (Asadov et al., 2025; Fasanya & Adekoya, 2022; M. C. Wu & Wang, 2024). These influences tend to overlap and change over time. Their interaction is often nonlinear, which makes short-term forecasting difficult, even when rich financial information is available (Gunay et al., 2025; Kumar Sharma et al., 2024; Zainudin et al., 2017).

Earlier empirical studies have mainly relied on linear time series models to study these relationships. Vector Autoregression (VAR) and its time-varying extension, the Time-Varying Parameter VAR (TVP–VAR), are among the most commonly used approaches (Ahmed et al., 2023; Li & Yuan, 2024). In the broader literature, TVP–VAR allows coefficients to adjust gradually and includes stochastic volatility, making it more flexible than a standard VAR when market conditions shift (Jiménez et al., 2023; Naifar, 2025). Even so, the underlying structure of the model remains linear. This limits its ability to reflect nonlinear behavior, threshold effects, and regime-specific patterns that frequently appear in real estate markets and in financial systems more broadly, especially during periods of stress or rapid change (Luo et al., 2024; Novotny & Hajek, 2025). In this study, however, the econometric benchmark is implemented as a TVP–VAR proxy via a recursively re-estimated expanding window VAR, chosen for empirical tractability and computational feasibility under our strictly time-ordered forecasting design, and intended to approximate gradual coefficient adaptation through sequential updating rather than to estimate a full state-space TVP–VAR with stochastic volatility.

Research on forecasting has not been limited to econometric models alone. In recent years, deep learning approaches have also begun to attract attention, especially Long Short-Term Memory (LSTM) neural networks. These models are often chosen because they can retain information over time and adapt to changing patterns, something that linear time series models struggle to do when relationships become unstable or nonlinear (Park & Yang, 2022; Suprihadi & Danila, 2024; Wei et al., 2025). Evidence from several empirical studies suggests that LSTMs perform well in forecasting tasks across equities, commodities, and foreign exchange markets, particularly during periods marked by repeated shocks or structural instability (Khan et al., 2022; Song et al., 2024).

Even so, their use in REIT forecasting remains relatively rare. One reason is not performance, but trust. Neural networks are frequently criticized for being difficult to interpret, which limits their acceptance in financial applications where regulatory oversight, accountability, and risk control play an important role (Chou et al., 2025; Rad et al., 2023).

In parallel with LSTM-based forecasting, the recent literature has advanced in two complementary directions. First, deep learning forecasting for financial time series has expanded beyond recurrent architectures toward attention-based and convolutional sequence models that are often more scalable and better at capturing long-range dependencies. Transformer-style models and their time series adaptations have become increasingly common in return and volatility prediction, motivated by their ability to learn dynamic cross-feature interactions without relying exclusively on gated recurrence. Related streams include temporal convolutional networks and hybrid attention-recurrent designs that stabilize training and improve responsiveness under abrupt market shifts. At the same time, representation-learning approaches that combine multi-horizon supervision, regime-aware conditioning, or feature augmentation have been used to improve robustness when the predictive signal is weak and nonstationarity is pronounced, which is typical for asset returns. This evolution suggests that the contemporary deep-learning benchmark set in financial forecasting is no longer confined to LSTM variants, but increasingly includes attention-based and hybrid architectures designed for regime sensitivity, nonlinear spillovers, and time-varying dependence.

Second, modern forecasting comparisons in finance frequently combine deep learning with advanced econometric and machine learning alternatives that explicitly target stylized facts such as conditional heteroskedasticity, structural instability, and heavy tails. On the econometric side, volatility-centered models in the GARCH family, mixed-frequency specifications, and state-space time-varying parameter models remain influential because they provide structured interpretations and well-understood inference under evolving uncertainty. On the machine learning side, tree-based ensembles and kernel methods continue to be widely used due to their strong performance under nonlinearities and their practicality with engineered predictors, and they are often paired with explainability tools for feature attribution. Taken together, these developments indicate that conclusions about the relative merits of TVP–VAR proxy and LSTM are most persuasive when positioned within this broader methodological landscape, while remaining attentive to the interpretability requirements that motivate explainable forecasting in financial applications.

This concern has motivated growing interest in explainable artificial intelligence. SHapley Additive exPlanations (SHAPs) offer a way to open the black box by attributing model predictions to individual input features, using principles drawn from cooperative game theory (C. Gong et al., 2024; Tan et al., 2023). In finance, SHAP has already been used in areas such as credit assessment, portfolio risk analysis, and factor-based modeling (Cil & Yildiz, 2025; Wen et al., 2022). However, its application to REIT return forecasting has received little attention so far.

Taken together, the existing literature leaves several questions unresolved. Comparisons between TVP–VAR proxy and LSTM models in the context of REIT forecasting are still limited, and studies that place both approaches within a single, harmonized experimental setting are even rarer. In particular, differences in data treatment, predictor selection, and evaluation design make it difficult to draw clear conclusions across studies. Moreover, TVP–VAR proxy evidence in this area is often tied to a single operationalization of time variation, which makes it difficult to disentangle whether performance differences reflect model class advantages or implementation choices. In addition, relatively little attention has been paid to understanding how LSTM-based forecasts can be interpreted, or how the influence of individual predictors may change as market conditions shift. As a result, the structural drivers of REIT returns remain only partially understood from the perspective of explainable deep learning (Chi et al., 2025; Mahmood et al., 2024).

Although the modeling components employed in this study—VAR-based benchmarks, LSTM architectures, and SHAPs—are established in the forecasting and machine learning literature, the contribution of the paper is empirical and design-oriented rather than methodological novelty. Specifically, the study provides a strictly harmonized and leakage-controlled forecasting protocol for weekly U.S. REIT returns that aligns data preprocessing, predictor construction, and one-step-ahead evaluation across an econometric benchmark and multiple neural network specifications. To ensure that conclusions are not an artifact of a single benchmark operationalization or a single sample split, the analysis is complemented with a structured robustness battery that extends the econometric comparison within the VAR family, evaluates alternative test start windows, replicates results under an alternative REIT proxy, and assesses sensitivity to random initialization in LSTM training. Finally, the study integrates explainability with an explicitly time-varying perspective by using SHAP to connect predictive drivers to economically interpretable channels and to document how feature contributions evolve across volatility conditions. In this sense, the paper aims to clarify what established forecasting and interpretability tools can and cannot deliver in the particularly low-signal setting of weekly return prediction, while maintaining full transparency and reproducibility.

This study is designed to respond to these limitations. It evaluates TVP–VAR proxy and LSTM models using weekly U.S. REIT returns under a strictly harmonized preprocessing and out-of-sample forecasting design, and it treats the rolling expanding window procedure as an empirically tractable proxy for time variation rather than a definitive TVP–VAR implementation. To avoid conclusions that hinge on a single approximation, the econometric comparison is complemented with additional closely related VAR-based benchmarks estimated under alternative recursive or rolling schemes, all using the same predictor set and evaluation protocol. The analysis also incorporates SHAP to examine how different predictors contribute to the forecasts, both on average and over time. By combining systematic model comparison with interpretability, the study puts forward a forecasting framework that aims to balance predictive performance with transparency, and the conclusions are stated relative to the benchmark set considered rather than as a universal claim of deep learning superiority.

To situate the core design choice, the emphasis is placed on a deliberate comparison between two complementary modeling archetypes under a unified pipeline rather than on enumerating the latest architectures. The TVP–VAR proxy is adopted as a transparent, economically grounded baseline that captures evolving linear dependence in a computationally feasible form, while the LSTM is used as a widely established nonlinear sequential benchmark that can absorb memory effects and interaction structure without turning the analysis into an architecture-engineering exercise. Although more recent families such as attention-based Transformers and temporal convolutional networks have attracted growing interest, their inclusion would expand the scope toward model development and extensive tuning, which would make it harder to attribute performance differences to the intended design contrast. The resulting framing therefore keeps the empirical focus on isolating the incremental value of nonlinear sequential modeling relative to a time-varying econometric proxy, and then interpreting any gains through a consistent, verifiable attribution layer.

2. Materials and Methods

This study develops a harmonized experimental framework to evaluate and explain the forecasting performance of Time-Varying Parameter VAR proxy (TVP–VAR proxy) and Long Short-Term Memory (LSTM) neural networks for weekly REIT return prediction. The methodological pipeline consists of five stages: (i) data preparation and transformation, (ii) model construction, (iii) out-of-sample forecasting, (iv) statistical evaluation, and (v) SHAP-based explainability. All procedures are implemented in Python 3.12 under identical preprocessing, feature sets, and evaluation windows to ensure a fair comparison between the TVP–VAR proxy benchmark and the LSTM specifications.

2.1. Research Framework

The research method is organized into several stages that guide the forecasting exercise from start to finish. It begins with assembling weekly VNQ price data together with a set of macro-financial variables that reflect conditions in equity markets, commodities, interest rates, credit, and market volatility. Once collected, the series are converted into log-returns or yield changes where appropriate, then aligned by date and checked for missing values. The final dataset is split chronologically into training, validation, and test samples to preserve the time structure of the data.

Model development proceeds through two parallel streams. The first employs a TVP–VAR proxy estimated using a rolling expanding window, allowing coefficients to adapt gradually as new information becomes available. The second constructs three LSTM architectures of increasing complexity—LSTM_A, LSTM_Base, and LSTM_B—each trained under identical data conditions to ensure comparability. All models generate one-step-ahead forecasts for the test period.

Performance evaluation uses RMSE, MAE, directional accuracy, and residual diagnostics to capture multiple dimensions of forecasting behavior. Statistical significance is assessed using the Diebold–Mariano test for squared error loss differences and the binomial sign test for directional accuracy. To address interpretability, SHAP values are computed for the LSTM_Base model to quantify global feature importance, nonlinear marginal effects, and time-varying contributions across market regimes.

The complete workflow is summarized in Figure 1, which illustrates the sequential steps for constructing, evaluating, and interpreting the TVP–VAR proxy and LSTM models. This structured design ensures methodological alignment, reproducibility, and a transparent connection between empirical results and economic interpretation.

2.2. Data and Variables

2.2.1. Data Sources

To reflect the broader macro-financial environment, the feature set includes variables spanning equity, rates, volatility, credit, commodities, and international spillovers. U.S. equity market conditions are proxied by the S&P 500 ETF (SPY), which captures broad risk-on/risk-off cycles that often co-move with listed real estate through common pricing factors. Interest rate exposure is represented by the change in the 10-year U.S. Treasury yield (TNX) and by the iShares 7–10 Year Treasury Bond ETF (IEF), reflecting the sensitivity of REIT valuations to discount rate dynamics and bond market conditions. Market uncertainty and time-varying risk sentiment are measured using the CBOE Volatility Index (VIX), which is commonly linked to de-risking episodes that affect real estate equities. Credit and liquidity conditions are proxied by the High-Yield Corporate Bond ETF (HYG), capturing fluctuations in funding risk and risk premia that can transmit to REIT returns (Ozcelebi & Yoon, 2025).

In addition, commodity channels are incorporated through crude oil futures (CL=F) and gold futures (GC=F). Oil serves as a broad macro and inflation-linked input that can influence discount rates and sectoral cash flow expectations, while gold is included as a safe haven proxy that reflects shifts in global risk aversion. Finally, to account for international equity spillovers and global co-movement, we include major regional equity indices: the Hang Seng Index (^HSI), the FTSE 100 (^FTSE), and the Nikkei 225 (^N225). These indices capture global market linkages and cross-regional information transmission that may affect U.S. REIT returns via international portfolio rebalancing, correlated risk premia, and synchronized sentiment. Together, the selected variables provide a parsimonious but comprehensive representation of the macro-financial channels that can shape short-horizon REIT return dynamics. All variable tickers and transformations used in the empirical analysis are listed explicitly here to match the feature indices reported in Table 4.

All series are collected at a weekly frequency over the full sample period 6 January 2015 to 30 December 2024 and are obtained from a single public market-data provider to ensure consistent timestamping and avoid cross-source timing mismatches. The target variable is the weekly return on the U.S. REIT proxy (VNQ), while all predictors are constructed from the corresponding weekly observations of the instruments listed above. Prior to modeling, the raw price or index levels are transformed into stationary inputs—weekly log returns for equity, commodity, and international index series; weekly changes for yield-based measures such as TNX; and level or return transformations for bond ETFs consistent with the forecasting specification—after which all variables are aligned on a common calendar via an inner join so that each observation corresponds to the same market week across the full feature set. This explicit dating and alignment ensures that the subsequent train–validation–test splits are purely chronological and that out-of-sample evaluation is conducted on a clearly defined test window reported in the next subsection.

The predictors are chosen to represent well-established macro-financial channels that theory and prior empirical work associate with REIT pricing and short-horizon return variation. Broad equity conditions (SPY) proxy systematic risk premia and risk-on/risk-off cycles that often co-move with listed real estate through common pricing factors. Interest rate and bond market variables (TNX and IEF) capture the discount rate and term-structure channel that directly affects REIT valuation via the present value of cash flows and financing conditions. Market uncertainty (VIX) reflects time-varying risk aversion and volatility-driven repricing that can trigger deleveraging and return comovement. Credit conditions (HYG) proxy fluctuations in funding liquidity and credit risk premia that transmit to REIT performance through refinancing risk and shifts in broader risk appetite. Commodity prices (CL=F and GC=F) are included to represent inflation-sensitive and safe haven dynamics—oil as a macro and inflation-linked input with implications for rates and expected cash flows, and gold as a flight-to-quality indicator of global risk sentiment. Finally, major international equity indices (^HSI, ^FTSE, ^N225) capture global spillovers and cross-regional information transmission that may affect U.S. REIT returns through correlated sentiment and international portfolio rebalancing.

2.2.2. Data Transformation

To ensure statistical consistency and facilitate meaningful comparison across series, all variables are transformed according to their economic characteristics. Price-based variables such as VNQ, SPY, VIX, and HYG are converted into log-returns. The transformation follows the standard formulation:

r_{t} = \ln (P_{t}) - l n (P_{t - 1})

(1)

This approach stabilizes variance and induces approximate stationarity, properties that are desirable for time series forecasting models (Hewamalage et al., 2021).

Interest rate variables do not follow a multiplicative structure and therefore require a different transformation. The 10-year Treasury yield is expressed in terms of absolute yield changes, defined as:

△ y_{t} = y_{t} - y_{t - 1}

(2)

After all transformations, the time series are merged using an inner join on the date index to ensure that each observation reflects the same market week across all variables. Any missing or misaligned entries generated during this process are removed through list-wise deletion, resulting in a clean and synchronized dataset for modeling.

2.2.3. Dataset Partitioning

After the data have been aligned and transformed, the sample is split in chronological order so that the time structure of the series is preserved and any form of look-ahead bias is avoided. The first 70 percent of the observations is used for model training. The next 15 percent serves as a validation set and is used to guide hyperparameter tuning and early stopping decisions. The remaining 15 percent of the data is kept separate and used only for out-of-sample evaluation.

For the neural network models, additional care is taken to avoid information leakage. All predictor variables used by the LSTM are standardized with a StandardScaler that is fitted only on the training sample. The same scaling parameters are then applied to the validation and test sets, ensuring that no information from future observations enters the training process.

To make the temporal design fully transparent, we additionally report the exact calendar boundaries implied by this chronological split. Under the main experimental configuration, the out-of-sample test window spans 26 July 2023 to 30 December 2024, while all observations prior to the test start date are used for training and validation under the 70/15/15 allocation. Because the LSTM forecasts are generated using a fixed lookback length, the first available prediction date in each evaluation window can occur after the nominal split point; in such cases, the effective test window begins at the first date for which the required lagged inputs are available, but the split itself remains strictly time ordered. For robustness and to ensure that conclusions do not depend on a single cutoff, alternative test start dates are also considered and summarized in Appendix A Table A2, which reports the corresponding evaluation windows under the same preprocessing, predictor set, and walk-forward protocol.

2.3. TVP–VAR Proxy Model

2.3.1. Model Specification

The TVP–VAR proxy is motivated by the broader TVP–VAR literature, in which the standard VAR framework is extended by allowing coefficients to evolve gradually over time to capture structural change and evolving market relationships (J. Wu & Wang, 2025). Let

X_{t}

denote a vector of endogenous variables. The TVP–VAR of order p can be written as:

X_{t} = A_{1, t} X_{t - 1} + \dots + A_{p, t} X_{t - p} + ε_{t},

(3)

where

A_{i, t}

represents the time-varying coefficient matrices, and

ε_{t}

is a vector of innovations assumed to follow a mean-zero process with time-dependent covariance structure. The flexibility of

A_{i, t}

allows the model to adapt to evolving macro-financial dynamics that static VAR models may fail to capture, especially in periods of heightened volatility or structural transitions. This formulation provides the conceptual basis for time variation, while the empirical benchmark employed in this study operationalizes time variation through a proxy estimation scheme described below.

2.3.2. Estimation Procedure

A fully Bayesian implementation of TVP–VAR typically relies on Markov Chain Monte Carlo (MCMC) methods and is computationally demanding, particularly when applied to high-frequency or multivariate financial data (Ge & Zhang, 2022). To preserve the spirit of coefficient adaptability while ensuring computational feasibility, this study adopts a rolling expanding window estimation strategy as an empirically tractable TVP–VAR proxy for time variation, rather than a definitive TVP–VAR implementation (X. Gong et al., 2023).

The process begins with selecting the optimal lag order using the Akaike Information Criterion, which balances model fit and parsimony. For each forecast iteration, the VAR(p) model is re-estimated using all available observations up to that point, allowing coefficients to update as new data arrive. After re-estimation, the model produces a one-step-ahead forecast for the REIT return series. This recursive approach approximates adaptive coefficient behavior and can capture gradual shifts in dependence patterns, while remaining substantially more efficient in terms of computation and model complexity (Będowska-Sójka et al., 2024).

To ensure that inference does not hinge on a single operationalization of time variation, the econometric benchmark set is augmented with additional closely related VAR-based baselines estimated under alternative updating schemes. In particular, alongside the rolling expanding proxy, we report results for (i) a fixed-length rolling window VAR and (ii) a recursive VAR benchmark, each evaluated under the same predictor set, lag-selection rule, and strictly time-ordered out-of-sample protocol used for the LSTM models. This design broadens the econometric comparison and supports conclusions that are conditional on a transparent benchmark set rather than on a single approximation choice.

2.3.3. State-Space TVP–VAR (Kalman Filter) Robustness Estimation

To further address concerns that empirical conclusions may depend on a single approximation of time variation, we additionally implement a state-space TVP–VAR estimated via the Kalman filter, which is the standard computational approach in the time-varying parameter VAR literature. In this formulation, the observation equation corresponds to the VAR representation in (3), while the time-varying coefficients follow a stochastic evolution (state) equation that allows parameters to drift smoothly over time. Specifically, stacking the coefficient matrices into a state vector, the model can be written in state-space form as:

X_{t} = Z_{t} β_{t} + ε_{t}, ε_{t} ~ N (0, R)

(4)

β_{t} = β_{t - 1} + u_{t}, u_{t} ~ N (0, Q)

(5)

where (

Z_{t} = I_{K} \otimes x_{t}

), with (

I_{K}

) denoting the (

K \times K

) identity matrix and (⊗) the Kronecker product. The disturbance terms (

ε_{t}

) and (

u_{t}

) are assumed mutually uncorrelated at all leads and lags. Under (4) and (5), the Kalman filter yields recursive updates of (

β_{t}

) as new observations arrive, producing one-step-ahead forecasts that reflect continuously evolving coefficients under a probabilistic updating rule. In implementation, the lag order (p) is selected using the Akaike Information Criterion, consistent with the main specification. The covariance components (R) and (Q) are estimated by maximizing the Gaussian log-likelihood implied by the Kalman filter prediction error decomposition.

Forecasts from the Kalman-filter TVP–VAR are evaluated under the same strictly time-ordered out-of-sample protocol, predictor set, and preprocessing steps used for the rolling-expanding proxy and the LSTM models. This robustness benchmark directly addresses the concern that results might be driven by the expanding window approximation rather than by the underlying econometric model class. To allow time variation while maintaining recursive estimation, coefficients follow a random walk state evolution. We implement the random walk evolution using a fixed discount (forgetting factor) formulation, which controls the effective state noise magnitude and produces smooth yet adaptive coefficient paths without ad hoc retuning.

2.4. LSTM Neural Network Models

2.4.1. Network Architecture

Long Short-Term Memory (LSTM) neural networks are specifically designed to capture nonlinear temporal dependencies by using gated memory mechanisms that regulate information flow across time steps (Chi et al., 2025). This structure enables LSTMs to learn complex, state-dependent dynamics that traditional linear time series models cannot represent, making them particularly suitable for financial forecasting applications characterized by volatility clustering, nonlinear shocks, and regime-dependent behavior (Wei et al., 2025).

To explore the role of network depth in forecasting performance, the study implements three LSTM models with gradually increasing complexity. The simplest specification, LSTM_A, uses a single hidden layer with 32 units and relies on an input window of eight observations. A more flexible baseline model, LSTM_Base, is constructed with two stacked LSTM layers containing 64 and 32 units and a longer window of twelve periods. The most complex variant, LSTM_B, extends this structure further by adding a third hidden layer, resulting in 128, 64, and 32 units, and uses a window size of twenty observations.

In all cases, the LSTM layers are followed by a dense output layer with linear activation to produce one-step-ahead return forecasts (Park & Yang, 2022). By varying both the depth of the network and the length of the input window, this design allows the analysis to examine how additional representational capacity influences out-of-sample predictive accuracy.

2.4.2. Training Procedure

All LSTM models are trained under a unified training protocol to ensure comparability across architectures. Model parameters are optimized by minimizing the Mean Squared Error (MSE), which provides a smooth and convex objective for gradient-based learning. The Adam optimizer is employed with a learning rate of either 0.001 or 0.0005, reflecting common practice in training recurrent neural networks for financial time series prediction. Each model is trained using a batch size of 32, and training convergence is stabilized through early stopping with a patience threshold of twenty epochs. Additionally, a learning-rate reduction on plateau is applied after ten epochs without improvement, allowing the optimizer to refine local minima during later training stages.

To transform the univariate REIT return series into a supervised learning problem, the dataset is structured using sliding windows. For a window size (w), the input sequence and target variable are constructed as:

X_{t} = {[r}_{1 - w + 1}, \dots, r_{t} [,

(6)

y_{t} = r_{t + 1},

(7)

where

X_{t}

contains the trailing sequence of returns and

y_{t}

represents the next-period return to be predicted. This formulation aligns the LSTM with the one-step-ahead forecasting objective used throughout the study.

Although all three LSTM configurations are evaluated for forecasting performance, only the baseline model is subjected to SHAP explainability analysis, as it achieves the best trade-off between predictive accuracy and architectural parsimony (Qian et al., 2025).

2.5. Forecasting and Model Evaluation

2.5.1. One-Step-Ahead Forecasting

Both the TVP–VAR proxy and LSTM models are evaluated under a strictly constrained one-step-ahead forecasting design, which ensures that predictions rely solely on information available at the corresponding time point, thereby preventing any form of look-ahead bias. The LSTM models generate forecasts using fixed parameters learned exclusively from the training and validation sets, reflecting a standard out-of-sample forecasting protocol for neural networks (Q. Wu et al., 2023). In contrast, the TVP–VAR proxy model updates its parameters recursively; at each forecasting step, the VAR(p) system is re-estimated using all observations available up to that date, allowing the framework to adapt gradually to evolving data-generating processes in a manner consistent with time-varying parameter modeling (J. Wu & Wang, 2025). This design ensures a fair and methodologically aligned comparison between a static-weight deep learning model and an adaptive econometric benchmark.

2.5.2. Accuracy Metrics

Model evaluation relies on three complementary performance measures—Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Directional Accuracy (DA)—each capturing a distinct dimension of forecasting behavior. RMSE penalizes large errors more heavily and is defined as:

R M S E = \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {(y_{t} - {\hat{y}}_{t})}^{2}},

(8)

MAE provides a more robust measure of average deviation, expressed as:

M A E = \frac{1}{N} \sum_{t = 1}^{N} | y_{t} - {\hat{y}}_{t} |,

(9)

Directional Accuracy evaluates whether the model correctly predicts the sign of the return, a metric particularly relevant for trading and risk management applications:

D A = \frac{1}{N} \sum_{t = 1}^{N} 1 (s i g n (y_{t}) = s i g n ({\hat{y}}_{t})),

(10)

By combining magnitude-based and direction-based metrics, the evaluation framework captures both statistical precision and decision-oriented performance, which is essential in financial forecasting contexts.

2.5.3. Proposed Forecasting Architecture

To ensure methodological consistency and enable transparent comparison across modeling approaches, this study proposes a unified forecasting architecture that integrates data preprocessing, window-based feature construction, model-specific forecasting logic, and unified evaluation procedures. The workflow is designed to preserve the temporal structure of the data while maintaining comparability between the adaptive TVP–VAR proxy and static-parameter LSTM models.

The following diagram (Figure 2) summarizes the end-to-end architecture:

The proposed architecture begins with collecting raw daily market and macro-financial data and proceeds through cleaning, alignment, and transformation stages that convert all series into compatible return or yield change formats. Sliding windows are then constructed to produce supervised learning sequences suitable for LSTM training while simultaneously generating the lagged structure required for VAR-based modeling.

The architecture then diverges into two parallel forecasting paths. The LSTM path uses fixed weights learned during training and validation, reflecting the standard practice in deep learning where models do not update parameters during testing. Conversely, the TVP–VAR proxy path re-estimates the VAR(p) model at every forecast step, allowing its coefficients to evolve with the data, consistent with the principles of time-varying parameter modeling (Jiménez et al., 2023; J. Wu & Wang, 2025).

Both paths ultimately generate one-step-ahead predictions, which feed into a unified evaluation module that computes RMSE, MAE, and Directional Accuracy, as well as the Diebold–Mariano test for statistical comparison. This unified architecture ensures that differences in forecasting performance can be attributed to model structure rather than discrepancies in data preparation or evaluation methodology.

2.5.4. Hyperparameter Selection and Data-Snooping Controls

To mitigate data-snooping concerns and ensure that model design choices are not informed by test set outcomes, all LSTM hyperparameters are selected using only the training and validation samples, with the test set reserved exclusively for final out-of-sample evaluation. Candidate LSTM configurations are prespecified and compared under the same one-step-ahead forecasting protocol used in the main experiments, and selection is based on minimizing validation RMSE. The search space includes architectural depth (one versus two LSTM layers), hidden dimension size, dropout regularization, learning rate, batch size, and lookback length used to construct supervised sequences. Model training employs early stopping monitored on the validation loss to reduce overfitting risk, and the final LSTM_Base specification corresponds to the configuration with the best validation performance. Importantly, no hyperparameter tuning or informal adjustment is performed using test results; once the LSTM_Base configuration is selected, it is held fixed for all reported comparisons and robustness checks to ensure that test set metrics reflect genuine out-of-sample performance rather than iterative tuning.

2.6. Statistical Significance Testing

2.6.1. Diebold–Mariano Test

Statistical comparison of competing forecasting models requires an assessment of whether observed differences in predictive accuracy are statistically meaningful rather than the result of random variation. To this end, the Diebold–Mariano (DM) test is employed to evaluate whether two models exhibit equal expected forecast loss over the evaluation horizon (Jiang et al., 2022). For two competing forecasts with errors

e_{1, t}

and

e_{2, t}

, the DM loss differential series is defined as:

d_{t} = L (e_{1, t}) - L (e_{2, t}),

(11)

where the squared error loss function is specified as:

L (e) = e^{2},

(12)

Under the null hypothesis of equal predictive accuracy, the expected value of

d_{t}

is zero. The DM statistic adjusts for serial correlation in the loss differential—an important consideration in time series forecasting—and provides a robust framework for determining whether one model significantly outperforms another in terms of squared predictive error.

2.6.2. Binomial Sign Test for Directional Accuracy

While magnitude-based metrics such as RMSE and MAE capture statistical precision, directional accuracy (DA) provides insight into the model’s ability to correctly anticipate the sign of returns, which is often more relevant in trading and allocation contexts (Greer, 2003). To examine whether a model’s directional predictions exceed random guessing, a binomial sign test is applied. The null and alternative hypotheses are specified as:

H_{0} : p = 0.5,

(13)

H_{1} : p > 0.5,

(14)

Here, (p) denotes the probability that the predicted sign matches the actual sign of the return. Rejecting the null hypothesis implies that the forecasting model delivers directionally informative signals beyond chance, providing practical relevance for applications in timing, hedging, and risk-sensitive decision-making (Greer, 2003).

2.7. SHAP Explainability Analysis

2.7.1. SHAP Framework

To address the interpretability limitations commonly associated with deep learning models, this study employs SHapley Additive exPlanations (SHAPs), a unified framework grounded in cooperative game theory that attributes the output of a model to the marginal contributions of each input feature (Hussain et al., 2021; Tan et al., 2023). For the LSTM model’s one-step-ahead forecast (

{\hat{y}}_{t}

), the SHAP decomposition expresses the prediction as the sum of a baseline expectation and a set of feature-specific contribution values:

{\hat{y}}_{t} = \emptyset_{0} + \sum i = 1^{k} \emptyset_{i, t},

(15)

where

\emptyset_{i, t}

denotes the Shapley value representing the contribution of feature (i) to the prediction at time t, and

\emptyset_{0}

represents the average model output over a reference distribution. This additive formulation ensures consistency and local accuracy, enabling SHAP to provide interpretable and theoretically principled explanations for complex nonlinear forecasting models (C. Gong et al., 2024).

2.7.2. Implementation

SHAP analysis focuses on the baseline LSTM model, as this specification provides a reasonable balance between forecasting accuracy and model complexity. To compute Shapley values, a subset of the training data is used as background samples, representing typical market conditions observed by the model during learning. The explainer is selected based on network compatibility, using DeepExplainer when feasible and GradientExplainer otherwise, so that feature contributions can be estimated without modifying the model structure.

Several types of SHAP outputs are examined. These include overall importance rankings to identify dominant predictors, beeswarm plots to show the distribution of feature effects, dependence plots to illustrate nonlinear relationships, and time-based SHAP paths that track how predictor influence changes over time. Together, these results help clarify not only which variables drive LSTM forecasts, but also how their roles vary across different market environments. This interpretability layer complements the forecasting results by linking model predictions to observable economic behavior.

2.8. Computational Environment

All computational procedures, including data preprocessing, model estimation, forecasting, and explainability analysis, were conducted using Python 3.12. The LSTM models were implemented with the TensorFlow/Keras framework. VAR-family benchmarks, including the TVP–VAR proxy constructed via recursive expanding window VAR estimation, were implemented using Statsmodels econometric routines. The state-space TVP–VAR robustness benchmark was implemented via a Kalman filter likelihood-based state-space procedure, consistent with the specification in Section 2.3.3. SHAP-based interpretability was performed using the SHAP library, and Streamlit was employed to structure a reproducible experimental pipeline that integrates data loading, model execution, visualization, and result exporting. All figures and tables presented in this manuscript were generated directly from this unified workflow, and the code and associated outputs are made available in the project repository (see the Data Availability statement), without relying on a separate Supplementary Materials file.

2.9. Use of Generative AI (GenAI)

Generative AI tools (specifically ChatGPT version 5.3) were used solely for tasks related to linguistic refinement, such as improving clarity, enhancing academic tone, structuring narrative coherence, and drafting preliminary section outlines based on numerical results provided by the authors. Importantly, GenAI was not used to generate, manipulate, or analyze data, nor to produce any computational results associated with the forecasting experiments, statistical tests, or SHAP analyses. All quantitative work was performed independently by the authors using the computational framework described above, ensuring the scientific integrity and reproducibility of the study.

2.10. Availability of Data, Code, and Materials

All datasets used in this study—including raw market data, processed time series, constructed feature sets, trained model outputs, forecast results, evaluation metrics, and SHAP interpretability artifacts—will be made publicly available upon acceptance of the manuscript. A dedicated online repository will host the full Python codebase encompassing data preparation scripts, model implementations for the TVP–VAR proxy benchmark (and the state-space TVP–VAR (Kalman filter) robustness specification) as well as the LSTM architectures, forecasting routines, statistical significance tests, and visualization modules. No proprietary, confidential, or restricted data sources were used. All materials necessary for full replication of the empirical results will therefore be openly accessible, supporting transparency and facilitating further research in REIT forecasting and explainable financial modeling.

3. Results

This section presents the empirical findings of the forecasting experiments, including model accuracy, statistical significance, residual behavior, and SHAP-based interpretability of the LSTM models. All results are evaluated under a strictly time-ordered out-of-sample design. Because different model classes impose different feasibility constraints—most notably the LSTM lookback requirement and lag-based information sets in VAR-type specifications—comparability is ensured through a common-date test window for head-to-head comparisons, while any full feasible-window results are documented separately for completeness.

3.1. Forecasting Performance Comparison

3.1.1. Main Out-of-Sample Accuracy Results

Table 1 reports the out-of-sample forecasting accuracy of the TVP–VAR proxy benchmark and the three LSTM architectures (LSTM_A, LSTM_Base, and LSTM_B). Overall, the LSTM models deliver improved predictive accuracy relative to the econometric benchmark. Among the neural specifications, LSTM_Base achieves the lowest RMSE (0.011193) and MAE (0.008267), and it also attains the highest directional accuracy (0.5318). Under a harmonized preprocessing and evaluation design, these results indicate that the baseline LSTM architecture provides the most favorable balance between numerical precision and directional performance for weekly U.S. REIT return prediction. Across LSTM variants, performance differences align with model complexity: the shallow LSTM_A exhibits weaker generalization, while the deeper LSTM_B delivers only marginal gains and shows mild degradation relative to the baseline, suggesting limited benefits from additional depth under the present forecasting setup. Figure 3 visualizes the RMSE comparison and confirms that LSTM_Base yields the smallest forecast error among all evaluated models.

To address the concern that conclusions could be specific to a single time-variation approximation, we further broaden the econometric benchmark set within the VAR family. In addition to the expanding window TVP–VAR proxy reported in Table 1, we evaluate closely related VAR-type baselines under alternative updating schemes (fixed-length rolling VAR and recursive expanding VAR) using the same predictors, AIC lag selection, and strictly time-ordered walk-forward protocol. The corresponding results are reported in Appendix A Table A1, and a state-space TVP–VAR (Kalman filter) robustness benchmark is reported in Appendix A Table A7. Across these alternatives, the qualitative pattern remains unchanged: any improvements from the nonlinear sequential models are modest and concentrated in error-based measures (RMSE/MAE), while directional predictability remains weak. For ease of interpretation, Figure 3 provides an at-a-glance visual summary of the RMSE values reported in Table 1, allowing an immediate comparison of the relative performance gaps between the econometric proxy and the three LSTM variants. To further convey estimation stability, Figure 3 additionally includes an error bar cap (±1 standard deviation) based on repeated LSTM_Base training runs with different random seeds, indicating the extent of seed-induced variation in RMSE. These findings also underscore an important practical limitation of the deep learning benchmarks in this setting, namely that accuracy gains do not necessarily translate into reliable sign prediction, motivating the limitations discussion.

3.1.2. Robustness Within the VAR Benchmark Family

To address the concern that conclusions might hinge on a single operationalization of time variation in the benchmark, we extend the econometric comparison within the VAR family. In addition to the expanding window re-estimation proxy (TVPVAR_expanding), we estimate a fixed-length rolling window VAR (VAR_rolling_fixed) that updates coefficients using only the most recent observations. For reference, we also report a static VAR fitted once on the initial training sample and evaluated using recursive (dynamic) forecasting (VAR_static_recursive). All VAR-based benchmarks use the same predictor set, AIC-based lag selection protocol, and strictly time-ordered out-of-sample evaluation used for the LSTM models. The corresponding results are summarized in Appendix A Table A1. The rolling-window VAR delivers accuracy comparable to the expanding proxy (RMSE 0.01155 vs. 0.01143; MAE 0.00864 vs. 0.00856), indicating that the benchmark performance is not driven by a single re-estimation choice. The static VAR achieves slightly lower RMSE and MAE (RMSE 0.01106; MAE 0.00825) but exhibits weaker directional accuracy (0.422), consistent with the difficulty of sign prediction for weekly returns. Importantly, these robustness benchmarks do not alter the qualitative interpretation of Table 1: the comparative findings should be viewed as conditional on the benchmark set considered rather than as a universal claim of model class dominance.

As a further robustness check addressing the conceptual distinction between rolling re-estimation and true time-varying parameter estimation, we additionally estimate a state-space TVP–VAR benchmark using a Kalman filter with time-varying coefficients. Unlike the expanding window TVP–VAR proxy, this specification updates the VAR coefficient vector recursively within an explicit state equation, thereby operationalizing parameter drift in a formal state-space framework. The corresponding out-of-sample results are reported in Appendix A Table A7. Over the common test window (7 July 2023 to 30 December 2024; n = 78), the Kalman filter TVP–VAR achieves RMSE = 0.01270 and MAE = 0.00926 with directional accuracy of 0.4850. Importantly, including this benchmark does not change the qualitative interpretation of Table 1: the LSTM_Base model retains more favorable error-based performance, while directional performance remains broadly comparable and weak in statistical terms.

3.1.3. Robustness to Alternative Test Windows

To mitigate reliance on a single train–validation–test division, we further assess sensitivity to alternative test start cutoffs. Specifically, we re-evaluate out-of-sample one-step-ahead forecasting performance under cutoffs at 1 January 2022, 1 January 2023, and 1 January 2024 while preserving the same predictor set, AIC-based lag selection, and strictly time-ordered evaluation protocol. The robustness results are reported in Appendix A Table A2. For the VAR-based benchmark, TVPVAR_expanding is recomputed separately for each cutoff, producing distinct test windows and performance profiles across periods. For LSTM_Base, robustness metrics are computed from the available prediction file generated in the main experiment; consequently, the effective evaluation window begins at the first available prediction date reported in Table A2 due to preprocessing and lookback requirements. Overall, the results indicate that the comparative interpretation of the main findings is not driven by a single period choice, and that the LSTM benchmark retains favorable performance on the common 2024 test window where both models are evaluated over identical dates.

3.1.4. Robustness to the REIT Proxy Choice

As an additional robustness check, we replicate the main evaluation using IYR in place of VNQ (Appendix A Table A3). Consistent with the VNQ-based main experiment, replacing VNQ with IYR yields the same qualitative conclusion: LSTM_Base delivers lower RMSE and MAE than the expanding window VAR proxy, while directional accuracy remains broadly comparable. This indicates that the main comparative findings are not driven by the specific REIT proxy used.

3.1.5. Walk-Forward Validation

Finally, we implement walk-forward validation using an expanding training window and annual out-of-sample test windows (2018–2024) to assess whether results generalize across multiple evaluation periods. Appendix A Table A6 reports fold-averaged performance (mean ± standard error) across the seven sequential test windows, providing a direct measure of how stable each model’s accuracy is across different market conditions rather than under a single split. Under this design, LSTM_Base continues to achieve lower average RMSE and MAE than the TVP–VAR proxy, while directional accuracy remains broadly comparable. Overall, the walk-forward results indicate that the LSTM advantage is not driven by a particular evaluation period and is instead modest but persistent for error-based metrics across multiple out-of-sample windows.

3.2. Forecast Visualization

Figure 4 overlays the realized VNQ returns and the one-step-ahead forecasts from all competing models to provide a high-level overview. Because weekly returns are highly volatile and the competing forecasts are concentrated near zero, individual model trajectories can be difficult to distinguish in a single multi-line panel. To improve readability, Appendix A Figure A1 presents a zoomed version of the same visualization restricted to a one-year subperiod (2024), which allows clearer inspection of differences in short-horizon forecast movements across models. In addition, Figure 5 provides a focused comparison between the realized series and the best-performing specification (LSTM_Base).

Among the LSTM architectures, LSTM_Base shows the most responsive adjustments to larger return swings, while forecasts remain centered near zero in line with the weak predictability of weekly returns.

A focused visualization of LSTM_Base is presented in Figure 5, which plots the realized VNQ returns against the one-step-ahead LSTM_Base forecasts over a restricted one-year window (2024) to improve visual interpretability. Because weekly returns exhibit a low signal-to-noise ratio, the LSTM forecasts remain concentrated near zero and do not track every high-frequency spike; instead, the figure highlights how LSTM_Base adjusts its forecasts in response to larger return swings while smoothing idiosyncratic fluctuations.

3.3. Statistical Significance Testing

3.3.1. Diebold–Mariano Test

To assess whether differences in forecast accuracy are statistically significant, Diebold–Mariano (DM) tests are conducted using squared error loss. Results are presented in Table 2.

LSTM_Base significantly outperforms the TVP–VAR proxy model at the 10% level. Differences among the LSTM architectures are not statistically significant, indicating that after a minimum complexity threshold, architectural depth yields diminishing returns in squared error accuracy.

3.3.2. Directional Accuracy Significance

Binomial sign tests evaluate whether directional accuracy exceeds the 0.5 benchmark expected from random guessing. Results are shown in Table 3.

Although LSTM_Base achieves the highest directional accuracy, the binomial test does not reject the null of 0.5 at conventional levels, indicating that evidence for directional predictability is weak in this sample. Accordingly, directional results are interpreted cautiously and are not used as the primary basis for model superiority, which is instead supported by the more stable improvements observed in RMSE and MAE. Directional accuracy is therefore reported as a complementary diagnostic rather than a statistically validated trading signal.

Moreover, DA is not a sufficient condition for profitability and should not be interpreted as direct evidence of tradable signal. Consistent with this, the economic value assessment with transaction costs does not yield positive net performance under the simple long–cash rule, underscoring that profitability remains strategy-dependent.

3.3.3. Sensitivity Analysis (Random Seed Robustness)

Across multiple random initializations of LSTM_Base (10 seeds), the performance dispersion remains limited: RMSE averages 0.011174 with a standard deviation of 0.000052 (min 0.011118; max 0.011260), while MAE averages 0.008292 with a standard deviation of 0.000034. Directional accuracy is 0.5447 on average with a standard deviation of 0.0649. Overall, the tight error distribution indicates that the main forecasting results are robust to randomness in model training rather than being driven by a favorable initialization (Appendix A Table A4).

3.3.4. Economic Value Evaluation

To evaluate whether the predictive signals translate into economic value after accounting for implementation frictions, we conduct a simple long–cash allocation backtest using the one-step-ahead return forecasts from LSTM_Base. The strategy takes a long position in VNQ when the predicted weekly return is positive and otherwise allocates to cash, with transaction costs of 10 bps applied whenever the position switches. Figure 6 plots the resulting equity curves (normalized to 1 at the start of the test window). Over the out-of-sample period (26 July 2023 to 30 December 2024; n = 75), the gross strategy produces a cumulative return of −1.95%, while the net performance declines to −4.56% after costs, reflecting the impact of turnover (27 switches over the test window). Risk-adjusted performance is correspondingly weak, with Sharpe ratios of −0.024 (gross) and −0.097 (net), and the maximum drawdown is −13.75% (identical gross and net in this sample). Overall, this evidence indicates that, although LSTM_Base improves error-based forecast stability, translating these forecasts into economically meaningful trading gains is challenging in this setting once realistic transaction costs are considered. Summary statistics for the economic value evaluation are reported in Appendix A Table A5.

3.4. Residual Behavior and Forecast Error Dynamics

Residual distributions are compared in Figure 7. Consistent with the RMSE evidence, LSTM_Base shows a more concentrated error distribution around zero, reflected in a tighter interquartile range and a median closer to zero, indicating reduced dispersion. In contrast, the TVP–VAR proxy exhibits a wider interquartile range and longer whiskers with more extreme outliers, suggesting heavier-tailed errors and greater variability, particularly during turbulent market periods.

Temporal error dynamics are further examined through rolling MAE values in Figure 8, which demonstrate that LSTM_Base maintains lower error levels during volatility spikes. In contrast, TVP–VAR proxy errors increase sharply and persistently during stress periods, underscoring the limitations of linear modeling in nonlinear environments.

3.5. SHAP Global Feature Importance

SHAP values decompose the LSTM_Base predictions into additive feature contributions. Beyond identifying which predictors matter most statistically, the resulting feature ranking admits a direct economic interpretation for U.S. equity REITs because the selected variables map to well-established return channels in listed real estate. In particular, REIT returns are shaped by (i) short-horizon persistence and investor flow effects, (ii) discount rate dynamics driven by long-term yields, (iii) credit and liquidity conditions that affect financing costs and risk premia, (iv) time-varying risk sentiment, and (v) international spillovers and macro-commodity linkages that transmit global shocks into U.S. real estate equities. Accordingly, high SHAP contributions for these predictors are consistent with recognized mechanisms rather than reflecting purely statistical artifacts.

The global importance ranking, presented in Table 4, indicates that the lagged VNQ return is the dominant predictor, followed by the Hang Seng Index (^HSI) and crude oil (CL=F), highlighting the combined relevance of autoregressive structure, global equity spillovers, and commodity-linked macro sensitivity in REIT forecasting. The prominence of yield_change_TNX and IEF further supports the central role of the discount rate channel in REIT valuation, while HYG and VIX capture credit risk conditions and shifts in market uncertainty that can amplify return fluctuations. Finally, the contributions of major international equity indices (^FTSE and ^N225) and safe haven/commodity proxies (GC=F) are consistent with synchronized risk cycles and cross-market information transmission that affect U.S. REIT pricing through correlated risk premia and portfolio rebalancing.

These rankings are visually confirmed in Figure 9, which displays the SHAP global importance bar plot.

3.6. SHAP Dependence and Time-Varying Effects

3.6.1. SHAP Dependence

Marginal feature effects are analyzed using SHAP dependence plots. The three most influential predictors—Crude Oil (CL=F), Hang Seng Index (^HSI), and lagged VNQ returns—are shown in Figure 10, Figure 11 and Figure 12.

Each plot reveals nonlinear and asymmetric influence patterns over the prediction space, consistent with economic expectations regarding global risk sentiment and short-term momentum.

3.6.2. Time-Varying SHAP Contributions

To examine how feature importance evolves over the test horizon, time series SHAP values are presented in Figure 13, Figure 14 and Figure 15. These results highlight dynamic regime sensitivity that cannot be reproduced by static linear models.

Crude oil and HSI become more influential during global turbulence, while autoregressive effects intensify during local REIT volatility spikes, revealing structural shifts in predictive drivers across regimes.

4. Discussion

This section interprets the empirical results by linking forecasting performance, statistical significance, and SHAP-based interpretability to broader economic mechanisms underlying REIT market behavior. The aim is not only to identify which model performs best but to explain why performance differences arise and what they imply for forecasting practice, risk assessment, and market understanding.

4.1. Why LSTM Outperforms TVP–VAR

The forecasting results show that the LSTM_Base model performs better than the TVP–VAR proxy across all reported accuracy measures (Table 1). While the reductions in RMSE and MAE are relatively small and only moderately significant (Table 2), the pattern is consistent. This outcome can be traced to differences in how the two approaches represent return dynamics. The TVP–VAR proxy relies on a linear structure with coefficients that evolve smoothly over time. Although this allows for gradual structural change, it restricts the model’s ability to reflect nonlinear interactions, threshold effects, and regime-dependent behavior that frequently characterize financial markets (Canöz & Kalkavan, 2024). In the context of REITs, returns are often affected by volatility clustering and sudden adjustments in interest rates and risk sentiment, which remain difficult to capture within a linear framework (Salisu et al., 2024).

LSTM models approach the problem differently. Through their gated memory structure, they can learn nonlinear and state-dependent relationships directly from the data and adjust as conditions change over time (Masum et al., 2022). This flexibility helps explain their stronger performance around turning points and short-lived market disruptions, as illustrated in Figure 4 and Figure 5. The residual patterns provide further support: LSTM_Base shows a tighter error distribution (Figure 7) and lower rolling MAE during volatile episodes (Figure 8). Comparing the neural architectures, the baseline configuration strikes the most effective balance. The shallow LSTM_A lacks sufficient flexibility, while the deeper LSTM_B exhibits signs of overfitting, making LSTM_Base the most stable and reliable specification.

4.2. Economic Significance of Directional Accuracy

Directional accuracy offers an additional lens that is closely related to trading and short-term allocation decisions. In this study, LSTM_Base records the highest directional accuracy at 0.5318, but the improvement cannot be distinguished statistically from random prediction (Table 3). This result is not unexpected. Weekly REIT returns are characterized by a weak signal relative to noise, which makes predicting the direction of returns more challenging than reducing forecast errors. Improvements in squared error measures therefore do not necessarily imply better sign predictions, particularly when returns fluctuate symmetrically around zero (Guzzetti, 2020). In addition, the limited length of the evaluation sample reduces the power of binomial significance tests.

However, the economic value test in Section 3.3.4 shows that under a transparent long–cash rule with 10 bps costs, net performance is negative, indicating that the observed directional metrics do not translate into exploitable profitability under the trading assumptions examined in this study. While modest directional edges may prove economically relevant under alternative portfolio constructions, position sizing rules, and risk controls (e.g., volatility scaling or more sophisticated overlays), such gains are not supported by the present backtest evidence and therefore remain strategy-dependent rather than implied by directional accuracy alone.

4.3. SHAP Insights: Global Linkages and Nonlinear Drivers

The SHAP results offer a clearer view of how the LSTM model uses available information. As shown in Table 4 and Figure 9, past VNQ returns contribute the most to the forecasts, suggesting that short-term persistence remains an important feature of REIT pricing. This pattern is consistent with gradual adjustment in REIT markets, where prices do not immediately reflect new information.

Several external variables also play a visible role. The Hang Seng Index (^HSI) and crude oil prices (CL=F) frequently appear among the strongest contributors, indicating that REIT returns respond to broader global conditions rather than domestic factors alone. The dependence plots (Figure 10, Figure 11 and Figure 12) show that these effects are not constant. In particular, the influence of ^HSI becomes more pronounced during periods of stronger performance in Asian equity markets.

Oil price movements also affect the forecasts, which is consistent with their link to inflation expectations and macroeconomic cycles (Luo et al., 2024). Other variables, including FTSE, Nikkei 225, gold, and credit market indicators, contribute more modestly but reinforce the presence of global and cross-asset linkages (Ozcelebi & Yoon, 2025). Interest rate measures such as TNX and IEF remain relevant, although their overall impact is smaller, in line with evidence that short-horizon REIT dynamics are increasingly shaped by international financial conditions (M. C. Wu & Wang, 2024).

The prominence of the Hang Seng Index (^HSI) in the SHAP attribution is economically consistent with the view that Hong Kong acts as a key conduit in global risk transmission between Asian and U.S. markets, especially under time-varying connectedness and volatility spillovers. Recent evidence documents that cross-market spillovers between the U.S., Hong Kong, and Mainland China are dynamic and that Hong Kong can play an intermediary role in transmitting shocks across regions, which supports interpreting ^HSI as an informative proxy for global risk conditions rather than merely a local equity factor. In the same spirit, the recurring importance of crude oil (CL=F) is consistent with the macro-financial transmission channel in which oil price shocks affect inflation expectations and discount rate dynamics, which can propagate to rate-sensitive assets such as securitized real estate. The literature on real estate and securitized real estate markets increasingly emphasizes that oil shocks matter not only through growth expectations but also through inflation and interest rate adjustments, thereby influencing REIT valuation and short-horizon return variation. Together, these channels provide a verifiable interpretation for why ^HSI and CL=F emerge as influential predictors in an LSTM setting: they proxy time-varying global risk and macro-inflation news that can shift common risk premia and discount rates, even when directional predictability remains weak.

4.4. Regime-Dependent Sensitivities

The time-varying SHAP patterns (Figure 12, Figure 13 and Figure 14) show that predictor importance shifts across market regimes. During volatility spikes, the influence of lagged VNQ, SPY, and TNX increases sharply, reflecting heightened sensitivity of REIT pricing to domestic equity sentiment and interest rate movements when market stress intensifies (Tadle, 2022). In calmer periods, global equity indices and commodity variables become more influential, suggesting stronger cross-market transmission when uncertainty is lower.

VIX exhibits relatively stable SHAP magnitudes across regimes, indicating that while volatility remains relevant, it does not dominate REIT dynamics compared with other macro-financial drivers. These regime-dependent patterns highlight the value of nonlinear models combined with SHAP interpretability for detecting evolving dependencies that static or slowly drifting linear models may obscure.

4.5. Implications for Forecasting, Portfolio Management, and Market Understanding

These results have several implications for forecasting and applied financial analysis. First, the LSTM specifications exhibit more stable error-based performance than the econometric benchmark set under a harmonized and strictly out-of-sample design, suggesting that nonlinear sequence models can offer incremental robustness when REIT–macro relationships are unstable. Second, the SHAP-based analysis improves practical usability by making it clearer how macro-financial variables contribute to forecast movements and how these contributions shift across market conditions, which is particularly relevant for risk management settings where transparency and accountability are central. From a portfolio perspective, the prominence of global equity indices and commodity-related variables implies that REIT exposures are better interpreted within a broader, internationally connected asset framework rather than only against domestic equity benchmarks. Overall, the interpretability layer complements the forecasting exercise by linking predictive patterns to economically plausible channels and by providing a structured way to communicate model behavior in applied decision contexts.

4.6. Limitations and Practical Scope

The modest visual tracking of realized weekly returns in the forecast plots should be interpreted in the context of the low signal-to-noise ratio that characterizes short-horizon return series. In such settings, even well-specified models often struggle to reproduce high-frequency fluctuations in the realized path, and performance differences are more appropriately assessed using strictly time-ordered out-of-sample metrics and robustness evidence rather than visual alignment alone. Consistent with this perspective, the results indicate that LSTM_Base yields small but persistent improvements in RMSE and MAE relative to the VAR-based benchmark set, while directional accuracy remains broadly comparable and is not consistently distinguishable from random prediction. Moreover, when forecasts are translated into a transparent long–cash allocation rule under realistic transaction costs, net-of-cost performance is weak, highlighting that incremental statistical gains do not automatically imply economically exploitable profitability in this sample. Accordingly, the practical scope of the present findings is best framed as incremental error reduction and improved interpretability of dependence channels under time-varying conditions, rather than as evidence of reliable trading profitability.

5. Conclusions

This paper evaluates whether LSTM neural networks can improve the forecasting of weekly U.S. REIT returns relative to a TVP–VAR proxy benchmark, under a fully harmonized experimental design that holds constant data preprocessing, feature construction, and the strictly time-ordered out-of-sample evaluation protocol. Within this controlled setting, the baseline LSTM configuration produces the most stable forecasts. The gains over the econometric benchmark are modest, but they appear consistently in error-based measures such as RMSE and MAE. In contrast, evidence for directional predictability is weaker: directional accuracy differences are small and not consistently distinguishable from benchmark performance at conventional significance levels, so directional results are interpreted as complementary diagnostics rather than as the primary basis for model superiority.

Beyond forecasting accuracy, the study emphasizes interpretability as a practical requirement in financial applications. The SHAP analysis provides a transparent decomposition of the LSTM forecasts and links the model’s behavior to economically meaningful channels. Recent REIT returns emerge as the dominant driver, consistent with short-horizon persistence and partial adjustment in listed real estate pricing. Global equity indicators—most notably the Hang Seng Index—also rank highly, together with crude oil, highlighting the role of cross-market spillovers and macro-commodity sensitivity in shaping U.S. REIT returns. Importantly, the influence of these predictors is not constant over time: their contributions vary across volatility regimes, consistent with changing dependence structures and shifts in risk transmission as market conditions move from calm to turbulent states.

The analysis remains subject to several limitations. The benchmark comparison is necessarily conditional on the model set considered and on the available evaluation window, and the predictive signal in weekly REIT returns is inherently weak. For this reason, the conclusions are stated cautiously and are supported by robustness exercises designed to reduce reliance on specific implementation choices. In particular, additional VAR-family benchmarks mitigate dependence on a single operationalization of time variation, an alternative REIT proxy (IYR) shows that the main error-based conclusions are not proxy-specific, and a seed-based sensitivity analysis indicates that the LSTM results are not driven by favorable random initialization. Future work could extend the framework to sector-level REIT indices, longer samples, richer information sets, and alternative nonlinear architectures, while maintaining the same emphasis on transparent evaluation and explainable forecasting.

In line with the empirical evidence, the contribution of the study is positioned as a transparent and reproducible framework for comparative forecasting and feature attribution analysis in REIT returns. The incremental benefits of nonlinear sequential modeling are best characterized as modest reductions in forecast error that are stable across robustness designs, while directional predictability remains weak; accordingly, results are interpreted relative to the benchmark set considered rather than as a universal claim of deep learning superiority or universally actionable sign-timing signals.

Author Contributions

Conceptualization, E.S.; methodology, E.S.; software, E.S.; formal analysis, E.S.; data curation, E.S., Z.A. and G.P.A.; writing—original draft, E.S. and N.D.; writing—review and editing, E.S. and N.D.; project administration, Z.A. and G.P.A. Funding acquisition: N.D. and Z.A. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Prince Sultan University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data used in this study are publicly available from Yahoo Finance and were accessed via the yfinance API. The REIT proxy series includes the Vanguard Real Estate ETF (VNQ) and, for robustness, the iShares U.S. Real Estate ETF (IYR), together with the macro–financial predictors described in Section 2.2.1. The complete Python code for data collection, preprocessing and alignment, train/validation/test splitting, TVP–VAR proxy (expanding window VAR) forecasting, LSTM training and evaluation, SHAP-based interpretability, and robustness analyses is available in a GitHub repository at: https://github.com/eddys2007-git/10-research (accessed on 15 February 2026). The repository README documents the directory structure and provides instructions to reproduce the main experiments, tables, and figures reported in this paper.

Acknowledgments

The authors would like to acknowledge Prince Sultan University’s support in paying the Article Processing Charges (APC) for this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Robustness VAR-Family Benchmarks under Alternative Updating Schemes (Out-of-Sample, Weekly U.S. REIT Returns).

Model	RMSE	MAE	Mean Error	Std Error	Max \|Error\|	Min \|Error\|	Directional Accuracy
TVPVAR_expanding	0.011429	0.008563	−0.000257	0.011426	0.054929	0.000214	0.4671
VAR_rolling_fixed	0.011548	0.008635	−0.000290	0.011545	0.055359	0.000024	0.4701
VAR_static_recursive	0.011060	0.008250	−0.000029	0.011060	0.053692	0.000115	0.4222

Note: This table reports additional VAR-family robustness benchmarks designed to assess whether econometric baseline performance is sensitive to the operationalization of parameter instability. TVPVAR_expanding denotes an expanding window re-estimation VAR used as an empirically tractable proxy for time variation; at each forecast origin, coefficients are updated by refitting a standard VAR(p) on an expanding information set, rather than estimating a state-space TVP–VAR with Kalman filtering and stochastic volatility. VAR_rolling_fixed re-estimates the VAR on a fixed-length rolling window, while VAR_static_recursive estimates the VAR once on the initial training sample and generates forecasts recursively without coefficient re-estimation. All specifications use the same predictor set, AIC-based lag selection, and strictly time-ordered out-of-sample evaluation protocol as the main experiments.

Table A2. Robustness to Alternative Test-Start Dates (Out-of-Sample, Weekly U.S. REIT Returns).

Cutoff Date	Model	RMSE	MAE	Mean Error	Std Error	DA	n_test	Test Start	Test End
1 January 2022	TVPVAR_expanding	0.014155	0.010774	−0.000508	0.014146	0.4550	667	4 January 2022	30 December 2024
1 January 2022	LSTM_Base	0.011163	0.008252	0.000025	0.011163	0.5311	323	26 July 2023	30 December 2024
1 January 2023	TVPVAR_expanding	0.012424	0.009453	−0.000434	0.012416	0.4652	446	4 January 2023	30 December 2024
1 January 2023	LSTM_Base	0.011163	0.008252	0.000025	0.011163	0.5311	323	26 July 2023	30 December 2024
1 January 2024	TVPVAR_expanding	0.010532	0.008036	−0.000629	0.010513	0.4955	223	4 January 2024	30 December 2024
1 January 2024	LSTM_Base	0.010181	0.007683	−0.000015	0.010181	0.5450	223	4 January 2024	30 December 2024

Note: This table evaluates robustness of one-step-ahead out-of-sample forecasting performance under alternative test-start cutoffs. TVPVAR_expanding is recomputed for each cutoff using the expanding window VAR benchmark employed as a TVP–VAR proxy, with the same predictor set, AIC-based lag selection, and strictly time-ordered walk-forward evaluation. LSTM_Base metrics are computed from the prediction file generated in the main experiment; therefore, the effective evaluation window for LSTM_Base begins at the first available prediction date (“Test Start”), which may occur after the nominal cutoff due to preprocessing alignment and the required input sequence lookback. Accordingly, Table A2 primarily assesses cutoff sensitivity for the VAR-based benchmark, while the LSTM results are reported on the available evaluation window for reference and are not re-estimated for each cutoff.

Table A3. Proxy Robustness Check Using IYR as an Alternative U.S. REIT Proxy.

Proxy	Model	RMSE	MAE	Mean Error	Std. Error	DA	N (Test)	Test Start	Test End
IYR	TVPVAR_expanding	0.011443	0.008592	−0.000295	0.011440	0.4641	335	7 July 2023	30 December 2024
IYR	LSTM_Base	0.011157	0.008260	−0.000877	0.011123	0.4627	323	26 July 2023	30 December 2024

Note: This table reports an auxiliary robustness check where the primary REIT proxy (VNQ) is replaced by IYR (iShares U.S. Real Estate ETF). Both models are evaluated using the same strictly time-ordered one-step-ahead forecasting protocol and the same preprocessing and feature construction pipeline as in the main experiment. RMSE and MAE are computed on the out-of-sample test window, while Directional Accuracy measures sign prediction consistency. Differences in test sample size and testcstart dates reflect the availability and alignment of proxy-specific prediction series under the unified experimental pipeline.

Table A4. LSTM_Base Sensitivity to Random Initialization (10 Seeds).

Metric	Mean	Std. Dev.	Min	Max	Number of Seeds
RMSE	0.011174	0.000052	0.011118	0.011260	10
MAE	0.008292	0.000034	0.008251	0.008346	10
Directional Accuracy	0.5447	0.0649	0.4222	0.6111	10

Note: This table summarizes a seed-based sensitivity analysis for the baseline LSTM model. LSTM_Base is retrained and re-evaluated under 10 different random seeds while keeping the data split, feature set, and hyperparameters fixed. Each run follows the same strictly time-ordered one-step-ahead out-of-sample evaluation protocol as the main experiment, and the reported statistics summarize the distribution of test period performance across seeds.

Table A5. Economic Value Evaluation for VNQ Using LSTM_Base Forecast Signals (10 bps Transaction Cost).

Strategy	Transaction Cost (bps per Switch)	Switches (Turnover)	Cumulative Return (Gross)	Cumulative Return (Net)	Sharpe (Gross)	Sharpe (Net)	Max Drawdown (Gross)	Max Drawdown (Net)
LSTM long–cash timing (LSTM_Base)	10	27	−1.95%	−4.56%	−0.024	−0.097	−13.75%	−13.75%
Buy-and-hold VNQ	0	0	3.07%	3.07%	0.101	0.101	−16.04%	−16.04%

Note: This table reports an out-of-sample economic value test that converts one-step-ahead LSTM_Base forecasts into a transparent long–cash allocation rule for VNQ. The strategy is long VNQ when the predicted return is positive and otherwise holds cash (zero return). Transaction costs of 10 bps are applied whenever the position switches between long and cash. Performance is evaluated over the test window using equity curves normalized to 1 at the start date; cumulative returns, Sharpe ratios (risk-free rate set to zero), maximum drawdown, turnover, and hit rate are computed from the resulting strategy returns.

Table A6. Walk-Forward Validation Summary Across Annual Test Windows (Out-of-Sample, Weekly U.S. REIT Returns).

Model	Folds (n)	RMSE (Mean ± SE)	MAE (Mean ± SE)	Directional Accuracy (Mean ± SE)
LSTM_Base	7	0.013454 ± 0.002613	0.009708 ± 0.001446	0.5125 ± 0.0182
TVPVAR_expanding	7	0.013944 ± 0.002573	0.010220 ± 0.001506	0.5033 ± 0.0193

Note: This table reports walk-forward (expanding window) validation results aggregated across seven sequential out-of-sample folds (annual test windows). For each fold, models are trained on all data available up to the fold start and evaluated using strictly time-ordered one-step-ahead forecasts on the subsequent test window. Reported values are fold-averaged means with standard errors (SE) across folds, computed under the same predictor set, preprocessing, and evaluation protocol used in the main experiments.

Table A7. State-Space TVP–VAR (Kalman Filter) Benchmark Results (Out-of-Sample, Weekly U.S. REIT Returns).

Model	RMSE	MAE	Directional Accuracy	n_Test	Test Start	Test End	Lag Order (p)	Estimation Mode
TVPVAR_KF	0.012699	0.009256	0.4850	335	7 July 2023	30 December 2024	4	MLE via Kalman prediction-error decomposition

Note: This table reports a supplementary state-space TVP–VAR benchmark estimated via Kalman filtering with time-varying coefficients (random-walk state equation). The benchmark is included to address the conceptual distinction between rolling re-estimation proxies and formal time-varying parameter estimation. Results are computed under the same predictor set, lag order selection protocol, and strictly time-ordered one-step-ahead evaluation used throughout the manuscript.

Figure A1. Zoomed visualization of Figure 4 over a one-year subperiod (2024). The figure plots realized VNQ returns and one-step-ahead forecasts from all models over a restricted window to improve line separation and facilitate model-by-model visual inspection.

References

Ahmed, R., Chen, X. H., Kumpamool, C., & Nguyen, D. T. K. (2023). Inflation, oil prices, and economic activity in recent crisis: Evidence from the UK. Energy Economics, 126, 106918. [Google Scholar] [CrossRef]
Asadov, A. I., Ibrahim, M. H., & Yildirim, R. (2025). Impact of house price on economic stability: Some lessons from OECD countries. Journal of Real Estate Finance and Economics, 71(2), 254–284. [Google Scholar] [CrossRef]
Będowska-Sójka, B., Górka, J., Hemmings, D., & Zaremba, A. (2024). Uncertainty and cryptocurrency returns: A lesson from turbulent times. International Review of Financial Analysis, 94, 103330. [Google Scholar] [CrossRef]
Canöz, İ., & Kalkavan, H. (2024). Forecasting the dynamics of the Istanbul real estate market with the Bayesian time-varying VAR model regarding housing affordability. Habitat International, 148, 103055. [Google Scholar] [CrossRef]
Chi, D. T. K., Kien, H. N. T., & Nguyen, T. Q. (2025). Enhancing forex market forecasting with feature-augmented multivariate LSTM models using real-time data. Knowledge-Based Systems, 330, 114500. [Google Scholar] [CrossRef]
Chou, J. S., Lin, K. C., & Pham, T. B. Q. (2025). AI-fused construction portfolio investment system with risk hedging using machine learning and long-short strategies. Applied Soft Computing, 183, 113555. [Google Scholar] [CrossRef]
Cil, A. E., & Yildiz, K. (2025). A systematic literature review on applications of explainable artificial intelligence in the financial sector. Internet of Things, 33, 101696. [Google Scholar] [CrossRef]
Danila, N. (2025). Interconnected dynamics of REITs and other financial instruments in ASEAN: A comprehensive analysis. Cogent Economics & Finance, 13(1), 2468388. [Google Scholar] [CrossRef]
Fasanya, I. O., & Adekoya, O. B. (2022). Macroeconomic risk factors and REITs returns predictability in African markets: Evidence from a new approach. Scientific African, 17, e01292. [Google Scholar] [CrossRef]
Ge, F., & Zhang, W. (2022). The determinants of cross-border bond risk premia. Journal of International Financial Markets, Institutions and Money, 81, 101680. [Google Scholar] [CrossRef]
Glickman, E. A. (2014). REITs and real estate corporate finance. In An introduction to real estate finance (pp. 361–397). Academic Press. [Google Scholar] [CrossRef]
Gogineni, S., Jain, P., & Upadhyay, A. (2024). Global REIT regulations and valuation. International Review of Economics & Finance, 93, 152–166. [Google Scholar] [CrossRef]
Gong, C., Liu, J., Dai, S., Hao, H., & Liu, H. (2024). Machine learning assisted prediction of the phonon cutoff frequency of ABO3 perovskite materials. Computational Materials Science, 239, 112943. [Google Scholar] [CrossRef]
Gong, X., Ye, X., Zhang, W., & Zhang, Y. (2023). Predicting energy futures high-frequency volatility using technical indicators: The role of interaction. Energy Economics, 119, 106533. [Google Scholar] [CrossRef]
Greer, M. (2003). Directional accuracy tests of long-term interest rate forecasts. International Journal of Forecasting, 19(2), 291–298. [Google Scholar] [CrossRef]
Gunay, S., Dömötör, B., & Víg, A. A. (2025). Investigation of emerging market stress under various frequency bands: Evidence from FX market uncertainty and liquidity. Emerging Markets Review, 65, 101262. [Google Scholar] [CrossRef]
Guzzetti, M. (2020). Approximating the time-weighted return: The case of flows at unknown time. Insurance: Mathematics and Economics, 90, 25–34. [Google Scholar] [CrossRef]
Hewamalage, H., Bergmeir, C., & Bandara, K. (2021). Recurrent neural networks for time series forecasting: Current status and future directions. International Journal of Forecasting, 37(1), 388–427. [Google Scholar] [CrossRef]
Hussain, S., Mustafa, M. W., Al-shqeerat, K. H. A., Saeed, F., & Al-rimy, B. A. S. (2021). A novel feature-engineered—NGBoost machine-learning consumption data. Sensors, 21(24), 8423. [Google Scholar] [CrossRef] [PubMed]
Jiang, H., Hu, W., Xiao, L., & Dong, Y. (2022). A decomposition ensemble based deep learning approach for crude oil price forecasting. Resources Policy, 78, 102855. [Google Scholar] [CrossRef]
Jiménez, A., Rodríguez, G., & Ataurima Arellano, M. (2023). Time-varying impact of fiscal shocks over GDP growth in Peru: An empirical application using hybrid TVP-VAR-SV models. Structural Change and Economic Dynamics, 64, 314–332. [Google Scholar] [CrossRef]
Khan, L., Amjad, A., Afaq, K. M., & Chang, H. T. (2022). Deep sentiment analysis using CNN-LSTM architecture of English and Roman Urdu text shared in social media. Applied Sciences, 12(5), 2694. [Google Scholar] [CrossRef]
Kumar Sharma, D., Prakash Varshney, R., Agarwal, S., Ali Alhussan, A., & Abdallah, H. A. (2024). Developing a multivariate time series forecasting framework based on stacked autoencoders and multi-phase feature. Heliyon, 10(7), e27860. [Google Scholar] [CrossRef]
Li, X., & Yuan, J. (2024). DeepTVAR: Deep learning for a time-varying VAR model with extension to integrated VAR. International Journal of Forecasting, 40(3), 1123–1133. [Google Scholar] [CrossRef]
Luo, C., Qu, Y., Su, Y., & Dong, L. (2024). Risk spillover from international crude oil markets to China’s financial markets: Evidence from extreme events and U.S. monetary policy. The North American Journal of Economics and Finance, 70, 102041. [Google Scholar] [CrossRef]
Mahmood, F., Zaied, Y. B., & Abedin, M. Z. (2024). Role of green finance instruments in shaping economic cycles. Technological Forecasting and Social Change, 209, 123792. [Google Scholar] [CrossRef]
Masum, M., Masud, M. A., Adnan, M. I., Shahriar, H., & Kim, S. (2022). Comparative study of a mathematical epidemic model, statistical modeling, and deep learning for COVID-19 forecasting and management. Socio-Economic Planning Sciences, 80, 101249. [Google Scholar] [CrossRef]
Naifar, N. (2025). Decomposed and partial connectedness between oil shocks and sovereign credit risk in emerging economies: Insights from the Russia-Ukraine war. Journal of Commodity Markets, 39, 100492. [Google Scholar] [CrossRef]
Novotny, J., & Hajek, P. (2025). Nonlinear spillovers in precious metals markets: A TCN-based extension of the Diebold–Yilmaz framework. Finance Research Letters, 86, 108486. [Google Scholar] [CrossRef]
Ozcelebi, O., & Yoon, S. M. (2025). Impact of financial stress on the REIT market stability. International Review of Economics & Finance, 100, 104114. [Google Scholar] [CrossRef]
Park, S., & Yang, J. S. (2022). Interpretable deep learning LSTM model for intelligent economic decision-making. Knowledge-Based Systems, 248, 108907. [Google Scholar] [CrossRef]
Qian, X., Wang, B., Chen, J., Fan, Y., Mo, R., Xu, C., Liu, W., Liu, J., & Zhong, P. A. (2025). An explainable ensemble deep learning model for long-term streamflow forecasting under multiple uncertainties. Journal of Hydrology, 662, 133968. [Google Scholar] [CrossRef]
Rad, H., Low, R. K. Y., Miffre, J., & Faff, R. (2023). The commodity risk premium and neural networks. Journal of Empirical Finance, 74, 101433. [Google Scholar] [CrossRef]
Salisu, A. A., Akinsomi, O., Ametefe, F. K., & Hammed, Y. S. (2024). Gold market volatility and REITs’ returns during tranquil and turbulent episodes. International Review of Financial Analysis, 95, 103348. [Google Scholar] [CrossRef]
Song, X., Xu, H., & Bai, Y. (2024). An integrated SWJ-LSTM-ETS modeling strategy for investigating upper morphology dynamics of a stochastic laboratory delta with environmental changes. Geomorphology, 445, 108977. [Google Scholar] [CrossRef]
Suprihadi, E., & Danila, N. (2024). Forecasting ESG stock indices using a machine learning approach. Global Business Review. [Google Scholar] [CrossRef]
Tadle, R. C. (2022). FOMC minutes sentiments and their impact on financial markets. Journal of Economics and Business, 118, 106021. [Google Scholar] [CrossRef]
Tan, B., Gan, Z., & Wu, Y. (2023). The measurement and early warning of daily financial stability index based on XGBoost and SHAP: Evidence from China. Expert Systems with Applications, 227, 120375. [Google Scholar] [CrossRef]
Wei, L., Chen, S., Lin, J., & Shi, L. (2025). Enhancing return forecasting using LSTM with agent-based synthetic data. Decision Support Systems, 193, 114452. [Google Scholar] [CrossRef]
Wen, X., Xie, Y., Jiang, L., Li, Y., & Ge, T. (2022). On the interpretability of machine learning methods in crash frequency modeling and crash modification factor development. Accident Analysis & Prevention, 168, 106617. [Google Scholar] [CrossRef]
Wu, J., & Wang, Y. (2025). The measurement of financialization and its dynamic relevance with the real economy growth: A TVP-VAR analysis. International Review of Financial Analysis, 106, 104545. [Google Scholar] [CrossRef]
Wu, M. C., & Wang, C. M. (2024). Revisiting the nexus of REITs returns and macroeconomic variables. Finance Research Letters, 59, 104837. [Google Scholar] [CrossRef]
Wu, Q., Ren, H., Shi, S., Fang, C., Wan, S., & Li, Q. (2023). Analysis and prediction of industrial energy consumption behavior based on big data and artificial intelligence. Energy Reports, 9, 395–402. [Google Scholar] [CrossRef]
Zainudin, Z., Ibrahim, I., Said, R. M., & Hussain, H. I. (2017). Debt and financial performance of reits in Malaysia: A moderating effect of financial flexibility. Jurnal Pengurusan, 50, 3–12. [Google Scholar] [CrossRef]

Figure 1. Research Framework for the Comparative Forecasting Study.

Figure 2. Comparative Forecasting Architecture of the TVP–VAR proxy and LSTM Models.

Figure 3. RMSE Comparison Across All Forecasting Models.

Figure 4. Actual vs. Forecasted VNQ Returns for All Models.

Figure 5. Actual versus one-step-ahead forecasts from LSTM_Base (zoomed one-year view, 2024). The restricted window improves visual interpretability of forecast deviations relative to the realized series.

Figure 6. Economic value evaluation for VNQ using LSTM_Base forecast signals under a long–cash allocation rule with 10 bps transaction costs. The figure plots the normalized equity curves (start = 1) for the gross strategy return and the net return after costs over the out-of-sample test window (26 July 2023 to 30 December 2024).

Figure 7. Residual Error Distribution Comparison (TVP–VAR proxy vs. LSTM_Base).

Figure 8. Rolling MAE (Window = 20) Over the Test Period.

Figure 9. SHAP Global Feature Importance Bar Plot (LSTM_Base).

Figure 10. SHAP Dependence Plot for Crude Oil (CL=F).

Figure 11. SHAP Dependence Plot for the Hang Seng Index (^HSI).

Figure 12. SHAP Dependence Plot for Lagged VNQ Returns.

Figure 13. Time-Varying SHAP Contributions for Crude Oil (CL=F).

Figure 14. Time-Varying SHAP Contributions for the Hang Seng Index (^HSI).

Figure 15. Time-Varying SHAP Contributions for Lagged VNQ Returns.

Table 1. Out-of-Sample Forecast Accuracy for All Models.

Model	RMSE	MAE	Mean Error	Std Error	Directional Accuracy
TVP-VAR proxy	0.011535	0.008622	0.000115	0.011532	0.4745
LSTM_A	0.011284	0.008424	0.000036	0.011276	0.4363
LSTM_B	0.011244	0.008325	0.000028	0.011242	0.5127
LSTM_Base	0.011193	0.008267	0.000029	0.011192	0.5318

Table 2. Diebold–Mariano Test for Model Comparisons.

Model 1	Model 2	DM Statistic	p-Value	Conclusion
LSTM_Base	TVP-VAR proxy	−1.7088	0.0875	Significant at 10%
LSTM_Base	LSTM_A	−1.0130	0.3110	Not Significant
LSTM_Base	LSTM_B	−0.9696	0.3323	Not Significant

Table 3. Directional Accuracy (Binomial Test).

Model	DA	z-Score	p-Value	Conclusion
TVP-VAR proxy	0.4745	−0.9029	0.8167	Not Significant
LSTM_A	0.4363	−2.2573	0.9880	Underperforms
LSTM_Base	0.5318	1.1287	0.1295	Not Significant
LSTM_B	0.5127	0.4515	0.3258	Not Significant

Table 4. SHAP Global Feature Importance (LSTM_Base).

Rank	Feature	Mean	Interpretation
1	VNQ	0.001329	Own momentum dominates short-term predictability
2	^HSI	0.001139	Strong global equity spillover
3	CL=F	0.000803	Commodity-driven sensitivity
4	^FTSE	0.000767	Global market influence
5	^N225	0.000710	Asian equity linkage
6	TNX	0.000553	Interest rate exposure
7	IEF	0.000515	Bond market conditions
8	^VIX	0.000340	Market risk sentiment
9	GC=F	0.000326	Safe haven dynamics
10	HYG	0.000321	Credit risk channel
11	SPY	0.000312	U.S. equity effect present but diffuse

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Suprihadi, E.; Danila, N.; Ali, Z.; Ananta, G.P. Interpretable Deep Learning for REIT Return Forecasting: A Comparative Study of LSTM, TVP–VAR Proxy, and SHAP-Based Explanations. Int. J. Financial Stud. 2026, 14, 73. https://doi.org/10.3390/ijfs14030073

AMA Style

Suprihadi E, Danila N, Ali Z, Ananta GP. Interpretable Deep Learning for REIT Return Forecasting: A Comparative Study of LSTM, TVP–VAR Proxy, and SHAP-Based Explanations. International Journal of Financial Studies. 2026; 14(3):73. https://doi.org/10.3390/ijfs14030073

Chicago/Turabian Style

Suprihadi, Eddy, Nevi Danila, Zaiton Ali, and Gede Pramudya Ananta. 2026. "Interpretable Deep Learning for REIT Return Forecasting: A Comparative Study of LSTM, TVP–VAR Proxy, and SHAP-Based Explanations" International Journal of Financial Studies 14, no. 3: 73. https://doi.org/10.3390/ijfs14030073

APA Style

Suprihadi, E., Danila, N., Ali, Z., & Ananta, G. P. (2026). Interpretable Deep Learning for REIT Return Forecasting: A Comparative Study of LSTM, TVP–VAR Proxy, and SHAP-Based Explanations. International Journal of Financial Studies, 14(3), 73. https://doi.org/10.3390/ijfs14030073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Deep Learning for REIT Return Forecasting: A Comparative Study of LSTM, TVP–VAR Proxy, and SHAP-Based Explanations

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Framework

2.2. Data and Variables

2.2.1. Data Sources

2.2.2. Data Transformation

2.2.3. Dataset Partitioning

2.3. TVP–VAR Proxy Model

2.3.1. Model Specification

2.3.2. Estimation Procedure

2.3.3. State-Space TVP–VAR (Kalman Filter) Robustness Estimation

2.4. LSTM Neural Network Models

2.4.1. Network Architecture

2.4.2. Training Procedure

2.5. Forecasting and Model Evaluation

2.5.1. One-Step-Ahead Forecasting

2.5.2. Accuracy Metrics

2.5.3. Proposed Forecasting Architecture

2.5.4. Hyperparameter Selection and Data-Snooping Controls

2.6. Statistical Significance Testing

2.6.1. Diebold–Mariano Test

2.6.2. Binomial Sign Test for Directional Accuracy

2.7. SHAP Explainability Analysis

2.7.1. SHAP Framework

2.7.2. Implementation

2.8. Computational Environment

2.9. Use of Generative AI (GenAI)

2.10. Availability of Data, Code, and Materials

3. Results

3.1. Forecasting Performance Comparison

3.1.1. Main Out-of-Sample Accuracy Results

3.1.2. Robustness Within the VAR Benchmark Family

3.1.3. Robustness to Alternative Test Windows

3.1.4. Robustness to the REIT Proxy Choice

3.1.5. Walk-Forward Validation

3.2. Forecast Visualization

3.3. Statistical Significance Testing

3.3.1. Diebold–Mariano Test

3.3.2. Directional Accuracy Significance

3.3.3. Sensitivity Analysis (Random Seed Robustness)

3.3.4. Economic Value Evaluation

3.4. Residual Behavior and Forecast Error Dynamics

3.5. SHAP Global Feature Importance

3.6. SHAP Dependence and Time-Varying Effects

3.6.1. SHAP Dependence

3.6.2. Time-Varying SHAP Contributions

4. Discussion

4.1. Why LSTM Outperforms TVP–VAR

4.2. Economic Significance of Directional Accuracy

4.3. SHAP Insights: Global Linkages and Nonlinear Drivers

4.4. Regime-Dependent Sensitivities

4.5. Implications for Forecasting, Portfolio Management, and Market Understanding

4.6. Limitations and Practical Scope

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI