1. Introduction
Real Estate Investment Trusts (REITs) are widely used in investment portfolios because they allow investors to gain exposure to real estate without giving up the liquidity and regulatory protections found in equity markets (
Danila, 2025;
Glickman, 2014;
Gogineni et al., 2024). In practice, however, REIT returns are not easy to anticipate. They respond to many macro-financial conditions at once, including movements in interest rates, changes in global sentiment, shifts in commodity prices, and the way risk spreads across markets (
Asadov et al., 2025;
Fasanya & Adekoya, 2022;
M. C. Wu & Wang, 2024). These influences tend to overlap and change over time. Their interaction is often nonlinear, which makes short-term forecasting difficult, even when rich financial information is available (
Gunay et al., 2025;
Kumar Sharma et al., 2024;
Zainudin et al., 2017).
Earlier empirical studies have mainly relied on linear time series models to study these relationships. Vector Autoregression (VAR) and its time-varying extension, the Time-Varying Parameter VAR (TVP–VAR), are among the most commonly used approaches (
Ahmed et al., 2023;
Li & Yuan, 2024). In the broader literature, TVP–VAR allows coefficients to adjust gradually and includes stochastic volatility, making it more flexible than a standard VAR when market conditions shift (
Jiménez et al., 2023;
Naifar, 2025). Even so, the underlying structure of the model remains linear. This limits its ability to reflect nonlinear behavior, threshold effects, and regime-specific patterns that frequently appear in real estate markets and in financial systems more broadly, especially during periods of stress or rapid change (
Luo et al., 2024;
Novotny & Hajek, 2025). In this study, however, the econometric benchmark is implemented as a TVP–VAR proxy via a recursively re-estimated expanding window VAR, chosen for empirical tractability and computational feasibility under our strictly time-ordered forecasting design, and intended to approximate gradual coefficient adaptation through sequential updating rather than to estimate a full state-space TVP–VAR with stochastic volatility.
Research on forecasting has not been limited to econometric models alone. In recent years, deep learning approaches have also begun to attract attention, especially Long Short-Term Memory (LSTM) neural networks. These models are often chosen because they can retain information over time and adapt to changing patterns, something that linear time series models struggle to do when relationships become unstable or nonlinear (
Park & Yang, 2022;
Suprihadi & Danila, 2024;
Wei et al., 2025). Evidence from several empirical studies suggests that LSTMs perform well in forecasting tasks across equities, commodities, and foreign exchange markets, particularly during periods marked by repeated shocks or structural instability (
Khan et al., 2022;
Song et al., 2024).
Even so, their use in REIT forecasting remains relatively rare. One reason is not performance, but trust. Neural networks are frequently criticized for being difficult to interpret, which limits their acceptance in financial applications where regulatory oversight, accountability, and risk control play an important role (
Chou et al., 2025;
Rad et al., 2023).
In parallel with LSTM-based forecasting, the recent literature has advanced in two complementary directions. First, deep learning forecasting for financial time series has expanded beyond recurrent architectures toward attention-based and convolutional sequence models that are often more scalable and better at capturing long-range dependencies. Transformer-style models and their time series adaptations have become increasingly common in return and volatility prediction, motivated by their ability to learn dynamic cross-feature interactions without relying exclusively on gated recurrence. Related streams include temporal convolutional networks and hybrid attention-recurrent designs that stabilize training and improve responsiveness under abrupt market shifts. At the same time, representation-learning approaches that combine multi-horizon supervision, regime-aware conditioning, or feature augmentation have been used to improve robustness when the predictive signal is weak and nonstationarity is pronounced, which is typical for asset returns. This evolution suggests that the contemporary deep-learning benchmark set in financial forecasting is no longer confined to LSTM variants, but increasingly includes attention-based and hybrid architectures designed for regime sensitivity, nonlinear spillovers, and time-varying dependence.
Second, modern forecasting comparisons in finance frequently combine deep learning with advanced econometric and machine learning alternatives that explicitly target stylized facts such as conditional heteroskedasticity, structural instability, and heavy tails. On the econometric side, volatility-centered models in the GARCH family, mixed-frequency specifications, and state-space time-varying parameter models remain influential because they provide structured interpretations and well-understood inference under evolving uncertainty. On the machine learning side, tree-based ensembles and kernel methods continue to be widely used due to their strong performance under nonlinearities and their practicality with engineered predictors, and they are often paired with explainability tools for feature attribution. Taken together, these developments indicate that conclusions about the relative merits of TVP–VAR proxy and LSTM are most persuasive when positioned within this broader methodological landscape, while remaining attentive to the interpretability requirements that motivate explainable forecasting in financial applications.
This concern has motivated growing interest in explainable artificial intelligence. SHapley Additive exPlanations (SHAPs) offer a way to open the black box by attributing model predictions to individual input features, using principles drawn from cooperative game theory (
C. Gong et al., 2024;
Tan et al., 2023). In finance, SHAP has already been used in areas such as credit assessment, portfolio risk analysis, and factor-based modeling (
Cil & Yildiz, 2025;
Wen et al., 2022). However, its application to REIT return forecasting has received little attention so far.
Taken together, the existing literature leaves several questions unresolved. Comparisons between TVP–VAR proxy and LSTM models in the context of REIT forecasting are still limited, and studies that place both approaches within a single, harmonized experimental setting are even rarer. In particular, differences in data treatment, predictor selection, and evaluation design make it difficult to draw clear conclusions across studies. Moreover, TVP–VAR proxy evidence in this area is often tied to a single operationalization of time variation, which makes it difficult to disentangle whether performance differences reflect model class advantages or implementation choices. In addition, relatively little attention has been paid to understanding how LSTM-based forecasts can be interpreted, or how the influence of individual predictors may change as market conditions shift. As a result, the structural drivers of REIT returns remain only partially understood from the perspective of explainable deep learning (
Chi et al., 2025;
Mahmood et al., 2024).
Although the modeling components employed in this study—VAR-based benchmarks, LSTM architectures, and SHAPs—are established in the forecasting and machine learning literature, the contribution of the paper is empirical and design-oriented rather than methodological novelty. Specifically, the study provides a strictly harmonized and leakage-controlled forecasting protocol for weekly U.S. REIT returns that aligns data preprocessing, predictor construction, and one-step-ahead evaluation across an econometric benchmark and multiple neural network specifications. To ensure that conclusions are not an artifact of a single benchmark operationalization or a single sample split, the analysis is complemented with a structured robustness battery that extends the econometric comparison within the VAR family, evaluates alternative test start windows, replicates results under an alternative REIT proxy, and assesses sensitivity to random initialization in LSTM training. Finally, the study integrates explainability with an explicitly time-varying perspective by using SHAP to connect predictive drivers to economically interpretable channels and to document how feature contributions evolve across volatility conditions. In this sense, the paper aims to clarify what established forecasting and interpretability tools can and cannot deliver in the particularly low-signal setting of weekly return prediction, while maintaining full transparency and reproducibility.
This study is designed to respond to these limitations. It evaluates TVP–VAR proxy and LSTM models using weekly U.S. REIT returns under a strictly harmonized preprocessing and out-of-sample forecasting design, and it treats the rolling expanding window procedure as an empirically tractable proxy for time variation rather than a definitive TVP–VAR implementation. To avoid conclusions that hinge on a single approximation, the econometric comparison is complemented with additional closely related VAR-based benchmarks estimated under alternative recursive or rolling schemes, all using the same predictor set and evaluation protocol. The analysis also incorporates SHAP to examine how different predictors contribute to the forecasts, both on average and over time. By combining systematic model comparison with interpretability, the study puts forward a forecasting framework that aims to balance predictive performance with transparency, and the conclusions are stated relative to the benchmark set considered rather than as a universal claim of deep learning superiority.
To situate the core design choice, the emphasis is placed on a deliberate comparison between two complementary modeling archetypes under a unified pipeline rather than on enumerating the latest architectures. The TVP–VAR proxy is adopted as a transparent, economically grounded baseline that captures evolving linear dependence in a computationally feasible form, while the LSTM is used as a widely established nonlinear sequential benchmark that can absorb memory effects and interaction structure without turning the analysis into an architecture-engineering exercise. Although more recent families such as attention-based Transformers and temporal convolutional networks have attracted growing interest, their inclusion would expand the scope toward model development and extensive tuning, which would make it harder to attribute performance differences to the intended design contrast. The resulting framing therefore keeps the empirical focus on isolating the incremental value of nonlinear sequential modeling relative to a time-varying econometric proxy, and then interpreting any gains through a consistent, verifiable attribution layer.
2. Materials and Methods
This study develops a harmonized experimental framework to evaluate and explain the forecasting performance of Time-Varying Parameter VAR proxy (TVP–VAR proxy) and Long Short-Term Memory (LSTM) neural networks for weekly REIT return prediction. The methodological pipeline consists of five stages: (i) data preparation and transformation, (ii) model construction, (iii) out-of-sample forecasting, (iv) statistical evaluation, and (v) SHAP-based explainability. All procedures are implemented in Python 3.12 under identical preprocessing, feature sets, and evaluation windows to ensure a fair comparison between the TVP–VAR proxy benchmark and the LSTM specifications.
2.1. Research Framework
The research method is organized into several stages that guide the forecasting exercise from start to finish. It begins with assembling weekly VNQ price data together with a set of macro-financial variables that reflect conditions in equity markets, commodities, interest rates, credit, and market volatility. Once collected, the series are converted into log-returns or yield changes where appropriate, then aligned by date and checked for missing values. The final dataset is split chronologically into training, validation, and test samples to preserve the time structure of the data.
Model development proceeds through two parallel streams. The first employs a TVP–VAR proxy estimated using a rolling expanding window, allowing coefficients to adapt gradually as new information becomes available. The second constructs three LSTM architectures of increasing complexity—LSTM_A, LSTM_Base, and LSTM_B—each trained under identical data conditions to ensure comparability. All models generate one-step-ahead forecasts for the test period.
Performance evaluation uses RMSE, MAE, directional accuracy, and residual diagnostics to capture multiple dimensions of forecasting behavior. Statistical significance is assessed using the Diebold–Mariano test for squared error loss differences and the binomial sign test for directional accuracy. To address interpretability, SHAP values are computed for the LSTM_Base model to quantify global feature importance, nonlinear marginal effects, and time-varying contributions across market regimes.
The complete workflow is summarized in
Figure 1, which illustrates the sequential steps for constructing, evaluating, and interpreting the TVP–VAR proxy and LSTM models. This structured design ensures methodological alignment, reproducibility, and a transparent connection between empirical results and economic interpretation.
2.2. Data and Variables
2.2.1. Data Sources
To reflect the broader macro-financial environment, the feature set includes variables spanning equity, rates, volatility, credit, commodities, and international spillovers. U.S. equity market conditions are proxied by the S&P 500 ETF (SPY), which captures broad risk-on/risk-off cycles that often co-move with listed real estate through common pricing factors. Interest rate exposure is represented by the change in the 10-year U.S. Treasury yield (TNX) and by the iShares 7–10 Year Treasury Bond ETF (IEF), reflecting the sensitivity of REIT valuations to discount rate dynamics and bond market conditions. Market uncertainty and time-varying risk sentiment are measured using the CBOE Volatility Index (VIX), which is commonly linked to de-risking episodes that affect real estate equities. Credit and liquidity conditions are proxied by the High-Yield Corporate Bond ETF (HYG), capturing fluctuations in funding risk and risk premia that can transmit to REIT returns (
Ozcelebi & Yoon, 2025).
In addition, commodity channels are incorporated through crude oil futures (CL=F) and gold futures (GC=F). Oil serves as a broad macro and inflation-linked input that can influence discount rates and sectoral cash flow expectations, while gold is included as a safe haven proxy that reflects shifts in global risk aversion. Finally, to account for international equity spillovers and global co-movement, we include major regional equity indices: the Hang Seng Index (^HSI), the FTSE 100 (^FTSE), and the Nikkei 225 (^N225). These indices capture global market linkages and cross-regional information transmission that may affect U.S. REIT returns via international portfolio rebalancing, correlated risk premia, and synchronized sentiment. Together, the selected variables provide a parsimonious but comprehensive representation of the macro-financial channels that can shape short-horizon REIT return dynamics. All variable tickers and transformations used in the empirical analysis are listed explicitly here to match the feature indices reported in Table 4.
All series are collected at a weekly frequency over the full sample period 6 January 2015 to 30 December 2024 and are obtained from a single public market-data provider to ensure consistent timestamping and avoid cross-source timing mismatches. The target variable is the weekly return on the U.S. REIT proxy (VNQ), while all predictors are constructed from the corresponding weekly observations of the instruments listed above. Prior to modeling, the raw price or index levels are transformed into stationary inputs—weekly log returns for equity, commodity, and international index series; weekly changes for yield-based measures such as TNX; and level or return transformations for bond ETFs consistent with the forecasting specification—after which all variables are aligned on a common calendar via an inner join so that each observation corresponds to the same market week across the full feature set. This explicit dating and alignment ensures that the subsequent train–validation–test splits are purely chronological and that out-of-sample evaluation is conducted on a clearly defined test window reported in the next subsection.
The predictors are chosen to represent well-established macro-financial channels that theory and prior empirical work associate with REIT pricing and short-horizon return variation. Broad equity conditions (SPY) proxy systematic risk premia and risk-on/risk-off cycles that often co-move with listed real estate through common pricing factors. Interest rate and bond market variables (TNX and IEF) capture the discount rate and term-structure channel that directly affects REIT valuation via the present value of cash flows and financing conditions. Market uncertainty (VIX) reflects time-varying risk aversion and volatility-driven repricing that can trigger deleveraging and return comovement. Credit conditions (HYG) proxy fluctuations in funding liquidity and credit risk premia that transmit to REIT performance through refinancing risk and shifts in broader risk appetite. Commodity prices (CL=F and GC=F) are included to represent inflation-sensitive and safe haven dynamics—oil as a macro and inflation-linked input with implications for rates and expected cash flows, and gold as a flight-to-quality indicator of global risk sentiment. Finally, major international equity indices (^HSI, ^FTSE, ^N225) capture global spillovers and cross-regional information transmission that may affect U.S. REIT returns through correlated sentiment and international portfolio rebalancing.
2.2.2. Data Transformation
To ensure statistical consistency and facilitate meaningful comparison across series, all variables are transformed according to their economic characteristics. Price-based variables such as VNQ, SPY, VIX, and HYG are converted into log-returns. The transformation follows the standard formulation:
This approach stabilizes variance and induces approximate stationarity, properties that are desirable for time series forecasting models (
Hewamalage et al., 2021).
Interest rate variables do not follow a multiplicative structure and therefore require a different transformation. The 10-year Treasury yield is expressed in terms of absolute yield changes, defined as:
After all transformations, the time series are merged using an inner join on the date index to ensure that each observation reflects the same market week across all variables. Any missing or misaligned entries generated during this process are removed through list-wise deletion, resulting in a clean and synchronized dataset for modeling.
2.2.3. Dataset Partitioning
After the data have been aligned and transformed, the sample is split in chronological order so that the time structure of the series is preserved and any form of look-ahead bias is avoided. The first 70 percent of the observations is used for model training. The next 15 percent serves as a validation set and is used to guide hyperparameter tuning and early stopping decisions. The remaining 15 percent of the data is kept separate and used only for out-of-sample evaluation.
For the neural network models, additional care is taken to avoid information leakage. All predictor variables used by the LSTM are standardized with a StandardScaler that is fitted only on the training sample. The same scaling parameters are then applied to the validation and test sets, ensuring that no information from future observations enters the training process.
To make the temporal design fully transparent, we additionally report the exact calendar boundaries implied by this chronological split. Under the main experimental configuration, the out-of-sample test window spans 26 July 2023 to 30 December 2024, while all observations prior to the test start date are used for training and validation under the 70/15/15 allocation. Because the LSTM forecasts are generated using a fixed lookback length, the first available prediction date in each evaluation window can occur after the nominal split point; in such cases, the effective test window begins at the first date for which the required lagged inputs are available, but the split itself remains strictly time ordered. For robustness and to ensure that conclusions do not depend on a single cutoff, alternative test start dates are also considered and summarized in
Appendix A Table A2, which reports the corresponding evaluation windows under the same preprocessing, predictor set, and walk-forward protocol.
2.3. TVP–VAR Proxy Model
2.3.1. Model Specification
The TVP–VAR proxy is motivated by the broader TVP–VAR literature, in which the standard VAR framework is extended by allowing coefficients to evolve gradually over time to capture structural change and evolving market relationships (
J. Wu & Wang, 2025). Let
denote a vector of endogenous variables. The TVP–VAR of order
p can be written as:
where
represents the time-varying coefficient matrices, and
is a vector of innovations assumed to follow a mean-zero process with time-dependent covariance structure. The flexibility of
allows the model to adapt to evolving macro-financial dynamics that static VAR models may fail to capture, especially in periods of heightened volatility or structural transitions. This formulation provides the conceptual basis for time variation, while the empirical benchmark employed in this study operationalizes time variation through a proxy estimation scheme described below.
2.3.2. Estimation Procedure
A fully Bayesian implementation of TVP–VAR typically relies on Markov Chain Monte Carlo (MCMC) methods and is computationally demanding, particularly when applied to high-frequency or multivariate financial data (
Ge & Zhang, 2022). To preserve the spirit of coefficient adaptability while ensuring computational feasibility, this study adopts a rolling expanding window estimation strategy as an empirically tractable TVP–VAR proxy for time variation, rather than a definitive TVP–VAR implementation (
X. Gong et al., 2023).
The process begins with selecting the optimal lag order using the Akaike Information Criterion, which balances model fit and parsimony. For each forecast iteration, the VAR(
p) model is re-estimated using all available observations up to that point, allowing coefficients to update as new data arrive. After re-estimation, the model produces a one-step-ahead forecast for the REIT return series. This recursive approach approximates adaptive coefficient behavior and can capture gradual shifts in dependence patterns, while remaining substantially more efficient in terms of computation and model complexity (
Będowska-Sójka et al., 2024).
To ensure that inference does not hinge on a single operationalization of time variation, the econometric benchmark set is augmented with additional closely related VAR-based baselines estimated under alternative updating schemes. In particular, alongside the rolling expanding proxy, we report results for (i) a fixed-length rolling window VAR and (ii) a recursive VAR benchmark, each evaluated under the same predictor set, lag-selection rule, and strictly time-ordered out-of-sample protocol used for the LSTM models. This design broadens the econometric comparison and supports conclusions that are conditional on a transparent benchmark set rather than on a single approximation choice.
2.3.3. State-Space TVP–VAR (Kalman Filter) Robustness Estimation
To further address concerns that empirical conclusions may depend on a single approximation of time variation, we additionally implement a state-space TVP–VAR estimated via the Kalman filter, which is the standard computational approach in the time-varying parameter VAR literature. In this formulation, the observation equation corresponds to the VAR representation in (3), while the time-varying coefficients follow a stochastic evolution (state) equation that allows parameters to drift smoothly over time. Specifically, stacking the coefficient matrices into a state vector, the model can be written in state-space form as:
where (
), with (
) denoting the (
) identity matrix and (⊗) the Kronecker product. The disturbance terms (
) and (
) are assumed mutually uncorrelated at all leads and lags. Under (4) and (5), the Kalman filter yields recursive updates of (
) as new observations arrive, producing one-step-ahead forecasts that reflect continuously evolving coefficients under a probabilistic updating rule. In implementation, the lag order (
p) is selected using the Akaike Information Criterion, consistent with the main specification. The covariance components (
R) and (
Q) are estimated by maximizing the Gaussian log-likelihood implied by the Kalman filter prediction error decomposition.
Forecasts from the Kalman-filter TVP–VAR are evaluated under the same strictly time-ordered out-of-sample protocol, predictor set, and preprocessing steps used for the rolling-expanding proxy and the LSTM models. This robustness benchmark directly addresses the concern that results might be driven by the expanding window approximation rather than by the underlying econometric model class. To allow time variation while maintaining recursive estimation, coefficients follow a random walk state evolution. We implement the random walk evolution using a fixed discount (forgetting factor) formulation, which controls the effective state noise magnitude and produces smooth yet adaptive coefficient paths without ad hoc retuning.
2.4. LSTM Neural Network Models
2.4.1. Network Architecture
Long Short-Term Memory (LSTM) neural networks are specifically designed to capture nonlinear temporal dependencies by using gated memory mechanisms that regulate information flow across time steps (
Chi et al., 2025). This structure enables LSTMs to learn complex, state-dependent dynamics that traditional linear time series models cannot represent, making them particularly suitable for financial forecasting applications characterized by volatility clustering, nonlinear shocks, and regime-dependent behavior (
Wei et al., 2025).
To explore the role of network depth in forecasting performance, the study implements three LSTM models with gradually increasing complexity. The simplest specification, LSTM_A, uses a single hidden layer with 32 units and relies on an input window of eight observations. A more flexible baseline model, LSTM_Base, is constructed with two stacked LSTM layers containing 64 and 32 units and a longer window of twelve periods. The most complex variant, LSTM_B, extends this structure further by adding a third hidden layer, resulting in 128, 64, and 32 units, and uses a window size of twenty observations.
In all cases, the LSTM layers are followed by a dense output layer with linear activation to produce one-step-ahead return forecasts (
Park & Yang, 2022). By varying both the depth of the network and the length of the input window, this design allows the analysis to examine how additional representational capacity influences out-of-sample predictive accuracy.
2.4.2. Training Procedure
All LSTM models are trained under a unified training protocol to ensure comparability across architectures. Model parameters are optimized by minimizing the Mean Squared Error (MSE), which provides a smooth and convex objective for gradient-based learning. The Adam optimizer is employed with a learning rate of either 0.001 or 0.0005, reflecting common practice in training recurrent neural networks for financial time series prediction. Each model is trained using a batch size of 32, and training convergence is stabilized through early stopping with a patience threshold of twenty epochs. Additionally, a learning-rate reduction on plateau is applied after ten epochs without improvement, allowing the optimizer to refine local minima during later training stages.
To transform the univariate REIT return series into a supervised learning problem, the dataset is structured using sliding windows. For a window size (
w), the input sequence and target variable are constructed as:
where
contains the trailing sequence of returns and
represents the next-period return to be predicted. This formulation aligns the LSTM with the one-step-ahead forecasting objective used throughout the study.
Although all three LSTM configurations are evaluated for forecasting performance, only the baseline model is subjected to SHAP explainability analysis, as it achieves the best trade-off between predictive accuracy and architectural parsimony (
Qian et al., 2025).
2.5. Forecasting and Model Evaluation
2.5.1. One-Step-Ahead Forecasting
Both the TVP–VAR proxy and LSTM models are evaluated under a strictly constrained one-step-ahead forecasting design, which ensures that predictions rely solely on information available at the corresponding time point, thereby preventing any form of look-ahead bias. The LSTM models generate forecasts using fixed parameters learned exclusively from the training and validation sets, reflecting a standard out-of-sample forecasting protocol for neural networks (
Q. Wu et al., 2023). In contrast, the TVP–VAR proxy model updates its parameters recursively; at each forecasting step, the VAR(
p) system is re-estimated using all observations available up to that date, allowing the framework to adapt gradually to evolving data-generating processes in a manner consistent with time-varying parameter modeling (
J. Wu & Wang, 2025). This design ensures a fair and methodologically aligned comparison between a static-weight deep learning model and an adaptive econometric benchmark.
2.5.2. Accuracy Metrics
Model evaluation relies on three complementary performance measures—Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Directional Accuracy (DA)—each capturing a distinct dimension of forecasting behavior. RMSE penalizes large errors more heavily and is defined as:
MAE provides a more robust measure of average deviation, expressed as:
Directional Accuracy evaluates whether the model correctly predicts the sign of the return, a metric particularly relevant for trading and risk management applications:
By combining magnitude-based and direction-based metrics, the evaluation framework captures both statistical precision and decision-oriented performance, which is essential in financial forecasting contexts.
2.5.3. Proposed Forecasting Architecture
To ensure methodological consistency and enable transparent comparison across modeling approaches, this study proposes a unified forecasting architecture that integrates data preprocessing, window-based feature construction, model-specific forecasting logic, and unified evaluation procedures. The workflow is designed to preserve the temporal structure of the data while maintaining comparability between the adaptive TVP–VAR proxy and static-parameter LSTM models.
The following diagram (
Figure 2) summarizes the end-to-end architecture:
The proposed architecture begins with collecting raw daily market and macro-financial data and proceeds through cleaning, alignment, and transformation stages that convert all series into compatible return or yield change formats. Sliding windows are then constructed to produce supervised learning sequences suitable for LSTM training while simultaneously generating the lagged structure required for VAR-based modeling.
The architecture then diverges into two parallel forecasting paths. The LSTM path uses fixed weights learned during training and validation, reflecting the standard practice in deep learning where models do not update parameters during testing. Conversely, the TVP–VAR proxy path re-estimates the VAR(
p) model at every forecast step, allowing its coefficients to evolve with the data, consistent with the principles of time-varying parameter modeling (
Jiménez et al., 2023;
J. Wu & Wang, 2025).
Both paths ultimately generate one-step-ahead predictions, which feed into a unified evaluation module that computes RMSE, MAE, and Directional Accuracy, as well as the Diebold–Mariano test for statistical comparison. This unified architecture ensures that differences in forecasting performance can be attributed to model structure rather than discrepancies in data preparation or evaluation methodology.
2.5.4. Hyperparameter Selection and Data-Snooping Controls
To mitigate data-snooping concerns and ensure that model design choices are not informed by test set outcomes, all LSTM hyperparameters are selected using only the training and validation samples, with the test set reserved exclusively for final out-of-sample evaluation. Candidate LSTM configurations are prespecified and compared under the same one-step-ahead forecasting protocol used in the main experiments, and selection is based on minimizing validation RMSE. The search space includes architectural depth (one versus two LSTM layers), hidden dimension size, dropout regularization, learning rate, batch size, and lookback length used to construct supervised sequences. Model training employs early stopping monitored on the validation loss to reduce overfitting risk, and the final LSTM_Base specification corresponds to the configuration with the best validation performance. Importantly, no hyperparameter tuning or informal adjustment is performed using test results; once the LSTM_Base configuration is selected, it is held fixed for all reported comparisons and robustness checks to ensure that test set metrics reflect genuine out-of-sample performance rather than iterative tuning.
2.6. Statistical Significance Testing
2.6.1. Diebold–Mariano Test
Statistical comparison of competing forecasting models requires an assessment of whether observed differences in predictive accuracy are statistically meaningful rather than the result of random variation. To this end, the Diebold–Mariano (DM) test is employed to evaluate whether two models exhibit equal expected forecast loss over the evaluation horizon (
Jiang et al., 2022). For two competing forecasts with errors
and
, the DM loss differential series is defined as:
where the squared error loss function is specified as:
Under the null hypothesis of equal predictive accuracy, the expected value of is zero. The DM statistic adjusts for serial correlation in the loss differential—an important consideration in time series forecasting—and provides a robust framework for determining whether one model significantly outperforms another in terms of squared predictive error.
2.6.2. Binomial Sign Test for Directional Accuracy
While magnitude-based metrics such as RMSE and MAE capture statistical precision, directional accuracy (DA) provides insight into the model’s ability to correctly anticipate the sign of returns, which is often more relevant in trading and allocation contexts (
Greer, 2003). To examine whether a model’s directional predictions exceed random guessing, a binomial sign test is applied. The null and alternative hypotheses are specified as:
Here, (
p) denotes the probability that the predicted sign matches the actual sign of the return. Rejecting the null hypothesis implies that the forecasting model delivers directionally informative signals beyond chance, providing practical relevance for applications in timing, hedging, and risk-sensitive decision-making (
Greer, 2003).
2.7. SHAP Explainability Analysis
2.7.1. SHAP Framework
To address the interpretability limitations commonly associated with deep learning models, this study employs SHapley Additive exPlanations (SHAPs), a unified framework grounded in cooperative game theory that attributes the output of a model to the marginal contributions of each input feature (
Hussain et al., 2021;
Tan et al., 2023). For the LSTM model’s one-step-ahead forecast (
), the SHAP decomposition expresses the prediction as the sum of a baseline expectation and a set of feature-specific contribution values:
where
denotes the Shapley value representing the contribution of feature (
i) to the prediction at time
t, and
represents the average model output over a reference distribution. This additive formulation ensures consistency and local accuracy, enabling SHAP to provide interpretable and theoretically principled explanations for complex nonlinear forecasting models (
C. Gong et al., 2024).
2.7.2. Implementation
SHAP analysis focuses on the baseline LSTM model, as this specification provides a reasonable balance between forecasting accuracy and model complexity. To compute Shapley values, a subset of the training data is used as background samples, representing typical market conditions observed by the model during learning. The explainer is selected based on network compatibility, using DeepExplainer when feasible and GradientExplainer otherwise, so that feature contributions can be estimated without modifying the model structure.
Several types of SHAP outputs are examined. These include overall importance rankings to identify dominant predictors, beeswarm plots to show the distribution of feature effects, dependence plots to illustrate nonlinear relationships, and time-based SHAP paths that track how predictor influence changes over time. Together, these results help clarify not only which variables drive LSTM forecasts, but also how their roles vary across different market environments. This interpretability layer complements the forecasting results by linking model predictions to observable economic behavior.
2.8. Computational Environment
All computational procedures, including data preprocessing, model estimation, forecasting, and explainability analysis, were conducted using Python 3.12. The LSTM models were implemented with the TensorFlow/Keras framework. VAR-family benchmarks, including the TVP–VAR proxy constructed via recursive expanding window VAR estimation, were implemented using Statsmodels econometric routines. The state-space TVP–VAR robustness benchmark was implemented via a Kalman filter likelihood-based state-space procedure, consistent with the specification in
Section 2.3.3. SHAP-based interpretability was performed using the SHAP library, and Streamlit was employed to structure a reproducible experimental pipeline that integrates data loading, model execution, visualization, and result exporting. All figures and tables presented in this manuscript were generated directly from this unified workflow, and the code and associated outputs are made available in the project repository (see the Data Availability statement), without relying on a separate Supplementary Materials file.
2.9. Use of Generative AI (GenAI)
Generative AI tools (specifically ChatGPT version 5.3) were used solely for tasks related to linguistic refinement, such as improving clarity, enhancing academic tone, structuring narrative coherence, and drafting preliminary section outlines based on numerical results provided by the authors. Importantly, GenAI was not used to generate, manipulate, or analyze data, nor to produce any computational results associated with the forecasting experiments, statistical tests, or SHAP analyses. All quantitative work was performed independently by the authors using the computational framework described above, ensuring the scientific integrity and reproducibility of the study.
2.10. Availability of Data, Code, and Materials
All datasets used in this study—including raw market data, processed time series, constructed feature sets, trained model outputs, forecast results, evaluation metrics, and SHAP interpretability artifacts—will be made publicly available upon acceptance of the manuscript. A dedicated online repository will host the full Python codebase encompassing data preparation scripts, model implementations for the TVP–VAR proxy benchmark (and the state-space TVP–VAR (Kalman filter) robustness specification) as well as the LSTM architectures, forecasting routines, statistical significance tests, and visualization modules. No proprietary, confidential, or restricted data sources were used. All materials necessary for full replication of the empirical results will therefore be openly accessible, supporting transparency and facilitating further research in REIT forecasting and explainable financial modeling.
4. Discussion
This section interprets the empirical results by linking forecasting performance, statistical significance, and SHAP-based interpretability to broader economic mechanisms underlying REIT market behavior. The aim is not only to identify which model performs best but to explain why performance differences arise and what they imply for forecasting practice, risk assessment, and market understanding.
4.1. Why LSTM Outperforms TVP–VAR
The forecasting results show that the LSTM_Base model performs better than the TVP–VAR proxy across all reported accuracy measures (
Table 1). While the reductions in RMSE and MAE are relatively small and only moderately significant (
Table 2), the pattern is consistent. This outcome can be traced to differences in how the two approaches represent return dynamics. The TVP–VAR proxy relies on a linear structure with coefficients that evolve smoothly over time. Although this allows for gradual structural change, it restricts the model’s ability to reflect nonlinear interactions, threshold effects, and regime-dependent behavior that frequently characterize financial markets (
Canöz & Kalkavan, 2024). In the context of REITs, returns are often affected by volatility clustering and sudden adjustments in interest rates and risk sentiment, which remain difficult to capture within a linear framework (
Salisu et al., 2024).
LSTM models approach the problem differently. Through their gated memory structure, they can learn nonlinear and state-dependent relationships directly from the data and adjust as conditions change over time (
Masum et al., 2022). This flexibility helps explain their stronger performance around turning points and short-lived market disruptions, as illustrated in
Figure 4 and
Figure 5. The residual patterns provide further support: LSTM_Base shows a tighter error distribution (
Figure 7) and lower rolling MAE during volatile episodes (
Figure 8). Comparing the neural architectures, the baseline configuration strikes the most effective balance. The shallow LSTM_A lacks sufficient flexibility, while the deeper LSTM_B exhibits signs of overfitting, making LSTM_Base the most stable and reliable specification.
4.2. Economic Significance of Directional Accuracy
Directional accuracy offers an additional lens that is closely related to trading and short-term allocation decisions. In this study, LSTM_Base records the highest directional accuracy at 0.5318, but the improvement cannot be distinguished statistically from random prediction (
Table 3). This result is not unexpected. Weekly REIT returns are characterized by a weak signal relative to noise, which makes predicting the direction of returns more challenging than reducing forecast errors. Improvements in squared error measures therefore do not necessarily imply better sign predictions, particularly when returns fluctuate symmetrically around zero (
Guzzetti, 2020). In addition, the limited length of the evaluation sample reduces the power of binomial significance tests.
However, the economic value test in
Section 3.3.4 shows that under a transparent long–cash rule with 10 bps costs, net performance is negative, indicating that the observed directional metrics do not translate into exploitable profitability under the trading assumptions examined in this study. While modest directional edges may prove economically relevant under alternative portfolio constructions, position sizing rules, and risk controls (e.g., volatility scaling or more sophisticated overlays), such gains are not supported by the present backtest evidence and therefore remain strategy-dependent rather than implied by directional accuracy alone.
4.3. SHAP Insights: Global Linkages and Nonlinear Drivers
The SHAP results offer a clearer view of how the LSTM model uses available information. As shown in
Table 4 and
Figure 9, past VNQ returns contribute the most to the forecasts, suggesting that short-term persistence remains an important feature of REIT pricing. This pattern is consistent with gradual adjustment in REIT markets, where prices do not immediately reflect new information.
Several external variables also play a visible role. The Hang Seng Index (^HSI) and crude oil prices (CL=F) frequently appear among the strongest contributors, indicating that REIT returns respond to broader global conditions rather than domestic factors alone. The dependence plots (
Figure 10,
Figure 11 and
Figure 12) show that these effects are not constant. In particular, the influence of ^HSI becomes more pronounced during periods of stronger performance in Asian equity markets.
Oil price movements also affect the forecasts, which is consistent with their link to inflation expectations and macroeconomic cycles (
Luo et al., 2024). Other variables, including FTSE, Nikkei 225, gold, and credit market indicators, contribute more modestly but reinforce the presence of global and cross-asset linkages (
Ozcelebi & Yoon, 2025). Interest rate measures such as TNX and IEF remain relevant, although their overall impact is smaller, in line with evidence that short-horizon REIT dynamics are increasingly shaped by international financial conditions (
M. C. Wu & Wang, 2024).
The prominence of the Hang Seng Index (^HSI) in the SHAP attribution is economically consistent with the view that Hong Kong acts as a key conduit in global risk transmission between Asian and U.S. markets, especially under time-varying connectedness and volatility spillovers. Recent evidence documents that cross-market spillovers between the U.S., Hong Kong, and Mainland China are dynamic and that Hong Kong can play an intermediary role in transmitting shocks across regions, which supports interpreting ^HSI as an informative proxy for global risk conditions rather than merely a local equity factor. In the same spirit, the recurring importance of crude oil (CL=F) is consistent with the macro-financial transmission channel in which oil price shocks affect inflation expectations and discount rate dynamics, which can propagate to rate-sensitive assets such as securitized real estate. The literature on real estate and securitized real estate markets increasingly emphasizes that oil shocks matter not only through growth expectations but also through inflation and interest rate adjustments, thereby influencing REIT valuation and short-horizon return variation. Together, these channels provide a verifiable interpretation for why ^HSI and CL=F emerge as influential predictors in an LSTM setting: they proxy time-varying global risk and macro-inflation news that can shift common risk premia and discount rates, even when directional predictability remains weak.
4.4. Regime-Dependent Sensitivities
The time-varying SHAP patterns (
Figure 12,
Figure 13 and
Figure 14) show that predictor importance shifts across market regimes. During volatility spikes, the influence of lagged VNQ, SPY, and TNX increases sharply, reflecting heightened sensitivity of REIT pricing to domestic equity sentiment and interest rate movements when market stress intensifies (
Tadle, 2022). In calmer periods, global equity indices and commodity variables become more influential, suggesting stronger cross-market transmission when uncertainty is lower.
VIX exhibits relatively stable SHAP magnitudes across regimes, indicating that while volatility remains relevant, it does not dominate REIT dynamics compared with other macro-financial drivers. These regime-dependent patterns highlight the value of nonlinear models combined with SHAP interpretability for detecting evolving dependencies that static or slowly drifting linear models may obscure.
4.5. Implications for Forecasting, Portfolio Management, and Market Understanding
These results have several implications for forecasting and applied financial analysis. First, the LSTM specifications exhibit more stable error-based performance than the econometric benchmark set under a harmonized and strictly out-of-sample design, suggesting that nonlinear sequence models can offer incremental robustness when REIT–macro relationships are unstable. Second, the SHAP-based analysis improves practical usability by making it clearer how macro-financial variables contribute to forecast movements and how these contributions shift across market conditions, which is particularly relevant for risk management settings where transparency and accountability are central. From a portfolio perspective, the prominence of global equity indices and commodity-related variables implies that REIT exposures are better interpreted within a broader, internationally connected asset framework rather than only against domestic equity benchmarks. Overall, the interpretability layer complements the forecasting exercise by linking predictive patterns to economically plausible channels and by providing a structured way to communicate model behavior in applied decision contexts.
4.6. Limitations and Practical Scope
The modest visual tracking of realized weekly returns in the forecast plots should be interpreted in the context of the low signal-to-noise ratio that characterizes short-horizon return series. In such settings, even well-specified models often struggle to reproduce high-frequency fluctuations in the realized path, and performance differences are more appropriately assessed using strictly time-ordered out-of-sample metrics and robustness evidence rather than visual alignment alone. Consistent with this perspective, the results indicate that LSTM_Base yields small but persistent improvements in RMSE and MAE relative to the VAR-based benchmark set, while directional accuracy remains broadly comparable and is not consistently distinguishable from random prediction. Moreover, when forecasts are translated into a transparent long–cash allocation rule under realistic transaction costs, net-of-cost performance is weak, highlighting that incremental statistical gains do not automatically imply economically exploitable profitability in this sample. Accordingly, the practical scope of the present findings is best framed as incremental error reduction and improved interpretability of dependence channels under time-varying conditions, rather than as evidence of reliable trading profitability.
5. Conclusions
This paper evaluates whether LSTM neural networks can improve the forecasting of weekly U.S. REIT returns relative to a TVP–VAR proxy benchmark, under a fully harmonized experimental design that holds constant data preprocessing, feature construction, and the strictly time-ordered out-of-sample evaluation protocol. Within this controlled setting, the baseline LSTM configuration produces the most stable forecasts. The gains over the econometric benchmark are modest, but they appear consistently in error-based measures such as RMSE and MAE. In contrast, evidence for directional predictability is weaker: directional accuracy differences are small and not consistently distinguishable from benchmark performance at conventional significance levels, so directional results are interpreted as complementary diagnostics rather than as the primary basis for model superiority.
Beyond forecasting accuracy, the study emphasizes interpretability as a practical requirement in financial applications. The SHAP analysis provides a transparent decomposition of the LSTM forecasts and links the model’s behavior to economically meaningful channels. Recent REIT returns emerge as the dominant driver, consistent with short-horizon persistence and partial adjustment in listed real estate pricing. Global equity indicators—most notably the Hang Seng Index—also rank highly, together with crude oil, highlighting the role of cross-market spillovers and macro-commodity sensitivity in shaping U.S. REIT returns. Importantly, the influence of these predictors is not constant over time: their contributions vary across volatility regimes, consistent with changing dependence structures and shifts in risk transmission as market conditions move from calm to turbulent states.
The analysis remains subject to several limitations. The benchmark comparison is necessarily conditional on the model set considered and on the available evaluation window, and the predictive signal in weekly REIT returns is inherently weak. For this reason, the conclusions are stated cautiously and are supported by robustness exercises designed to reduce reliance on specific implementation choices. In particular, additional VAR-family benchmarks mitigate dependence on a single operationalization of time variation, an alternative REIT proxy (IYR) shows that the main error-based conclusions are not proxy-specific, and a seed-based sensitivity analysis indicates that the LSTM results are not driven by favorable random initialization. Future work could extend the framework to sector-level REIT indices, longer samples, richer information sets, and alternative nonlinear architectures, while maintaining the same emphasis on transparent evaluation and explainable forecasting.
In line with the empirical evidence, the contribution of the study is positioned as a transparent and reproducible framework for comparative forecasting and feature attribution analysis in REIT returns. The incremental benefits of nonlinear sequential modeling are best characterized as modest reductions in forecast error that are stable across robustness designs, while directional predictability remains weak; accordingly, results are interpreted relative to the benchmark set considered rather than as a universal claim of deep learning superiority or universally actionable sign-timing signals.