Learning Forecast-Efﬁcient Yield Curve Factor Decompositions with Neural Networks

: Most factor-based forecasting models for the term structure of interest rates depend on a ﬁxed number of factor loading functions that have to be speciﬁed in advance. In this study, we relax this assumption by building a yield curve forecasting model that learns new factor decompositions directly from data for an arbitrary number of factors, combining a Gaussian linear state-space model with a neural network that generates smooth yield curve factor loadings. In order to control the model complexity, we deﬁne prior distributions with a shrinkage effect over the model parameters, and we present how to obtain computationally efﬁcient maximum a posteriori numerical estimates using the Kalman ﬁlter and automatic differentiation. An evaluation of the model’s performance on 14 years of historical data of the Brazilian yield curve shows that the proposed technique was able to obtain better overall out-of-sample forecasts than traditional approaches, such as the dynamic Nelson and Siegel model and its extensions.


Introduction
Yield curve forecasting is an important instrument used by treasuries, central banks, and market participants in a wide range of applications, such as financial asset pricing and bond portfolio management. Yield curve forecasting models have increased their predictive abilities since the work of Diebold and Li (2006), who proposed the use of statistical models focused on out-of-sample forecasting performance, in contrast to previous works based on no-arbitrage (Hull and White 1990;Heath et al. 1992) and equilibrium approaches (Vasicek 1977;Cox et al. 2005;Duffie and Kan 1996).
A major part of the predictive power of the Diebold and Li (2006) model is attributed to the advantageous properties of the Nelson and Siegel (1987) decomposition, which is able to describe a variety of yield curve shapes with a small number of factors. Following a two-step estimation approach, Diebold and Li (2006) use the Nelson and Siegel model to decompose the yield curves independently into common factors, which are then used as in the traditional multivariate time-series setting.
Recent works show that independent yield curve decomposition models, which can be adapted to a forecast setting using the two-step estimation approach, can be further improved to consider more complex yield curve shapes. For example, Faria and Almeida (2018) suggest a hybrid model with a parametric component for the longer end of the curve and a B-spline model for the shorter end, Takada and Stern (2015) use a method of yield curve factor decomposition based on non-negative matrix factorization algorithms, and Mineo et al. (2020) propose arbitrage-free decomposition based on dynamic constrained B-splines, which is then used to forecast the yield curve using univariate time-series models.
In a similar vein, other authors suggest using more powerful yield curve decompositions in a state-space modeling framework, which allows estimating the entire model in one step. For instance, Bowsher and Meeks (2008) propose the functional signal plus noise model, which generates the yield curve using a single natural cubic spline with k knots constructed to interpolate a set of "latent" yield curve coordinates, which are modeled as a VAR(1) process. Despite performing dimensionality reduction, the functional signal plus noise model does not "disentangle" the yield curve into latent factors, as in the Nelson and Siegel decomposition, since the latent coordinates are restricted to points close to the observed yield curve. Alternatively, Hays et al. (2012) employ natural cubic spline functions for each factor loading in a linear state-space model with AR(1) dynamics. The factor loading functions are estimated sequentially using all of the selected maturities in the dataset as knots.
In this paper, we propose augmenting the Gaussian linear state-space model structure of yield curve factor models with a neural network that generates parameterized smooth factor loading functions, which can be jointly estimated with the transition parameters in one step. To help ensure that the generated yield curve factor loadings have good properties (such as smoothness) and to avoid the problem of overfitting in the temporal component of the model, we specify prior distributions for the model parameters that reflect these premises.
Among all of the works reviewed, our approach bears the greatest resemblance to the model proposed by Hays et al. (2012), although we highlight some key differences. Natural cubic splines depend on the number and position of knots, which need to be set beforehand, whereas neural networks do not have such a requirement. In practice, selecting the number and position of knots can be a difficult task, which can significantly impact the resulting curve. Secondly, due to computational constraints, Hays et al. (2012) estimate each factor loading function sequentially. In this work, we remove this restriction and estimate all factor loading functions simultaneously. Finally, we also investigate the impact of prior distributions with a regularization effect on both state and space parameters of yield curve forecasting models.
In an empirical evaluation conducted on 14 years of Brazilian yield curve data, we find that the proposed technique achieves better out-of-sample forecasting performance than traditional methods while still maintaining the advantageous properties of classic yield curve models. Given its similarity to long-established factor-based models, we highlight that the proposed technique can also be adapted to include a wide range of extensions proposed in the yield curve forecasting literature in the last decades, which may constitute an interesting topic for future studies.
The rest of this paper is organized as follows. We briefly review the dynamic Nelson and Siegel model family and parameter estimation approaches in Section 2. In Section 3, we introduce the neural network augmented state-space model. In Sections 4 and 5, we evaluate the proposed model's performance and present our concluding remarks.

The Dynamic Nelson and Siegel Model
In their seminal work, Nelson and Siegel (1987) propose a parsimonious three-factor yield curve model that can represent a variety of common yield curve shapes with precision. The Nelson and Siegel model (written using Diebold and Li (2006) parameterization) for the yield curve y evaluated at maturity m is given by where m is the yield maturity; β 0 , β 1 , and β 2 are the yield curve factors; λ is an exponential decay parameter; and (m) is an uncorrelated error term. By individually estimating the parameters of the Nelson and Siegel model, any given yield curve y can be succinctly represented in four-dimensional space constructed using the obtained yield curve factors and exponential decay term, effectively functioning as a dimensionality reduction technique.
An advantage of the Nelson and Siegel yield curve formulation is the existence of economic interpretations of the model factors. The intercept parameter β 0 can be regarded as a long-term factor, given that y(m) asymptotically approaches β 0 as m increases. The factor β 1 can be interpreted as a short-term factor, since the factor loading function 1−e −λm λm is a strictly decreasing function starting at 1 and decreasing toward 0. Lastly, the third factor is interpreted as a medium-term factor due to the fact that the loading function 1−e −λm λm − e −λm starts at 0, reaches a positive peak, and then decays toward 0. The parameter λ affects the rate of decay of both short-and medium-term factor loadings as well as the location of the peak in the medium-term loading. Nelson and Siegel (1987) recommend treating the exponential decay parameter λ as fixed, allowing the factors {β 0 , β 1 , β 2 } to be estimated through ordinary least squares (OLS), significantly simplifying the estimation procedure.
In order to adapt the Nelson and Siegel decomposition of the term structure to the forecasting setting, Diebold and Li (2006) models the evolution of the latent yield curve factors as an autoregressive process: where β t is a vector of factors at time t, F and P are 3 × 3 matrices, and P is a positive semidefinite matrix. For a fixed set of maturities m = [m 1 , m 2 , ..., m M ] T , the above expression can be further simplified to where y t = [y t (m 1 ), ..., y t (m M )] T , Q is a diagonal positive semi-definite matrix, and the factor loading matrix H λ is given by The estimation method of Diebold and Li (2006), which is often referred to as the two-step dynamic Nelson and Siegel model, involves first extracting the Nelson and Siegel factors independently for each yield curve and then fitting the autoregressive model with the factors extracted in the first step. In order to increase the numerical stability and computational efficiency of the first step, the exponential decay parameter λ is treated as fixed, allowing the estimation of the Nelson and Siegel factors through ordinary least squares (OLS) estimation, as noted before. Diebold and Li (2006) further suggest fixing the exponential decay parameter λ as the value that maximizes the medium-term loading function 1−e −λm λm − e −λm when evaluated at the average sample maturity 1 M ∑ M i=1 m i . Alternatively, a grid search procedure can also be used to select λ, as suggested by Nelson and Siegel (1987).
More generally, the dynamic Nelson and Siegel model is a particular case of the Gaussian linear state-space model (often also referred in this context to as the dynamic Gaussian factor model) with measurement Equation (4) and state transition Equation (3), as shown in . Therefore, the parameters of the state transition and measurement equations can be jointly estimated using the prediction error decomposition to compute the model likelihood. This general estimation procedure is described in more detail in the next section.

The Dynamic Nelson, Siegel, and Svensson Model
In order to increase the model flexibility, Svensson (1994) includes a fourth factor in the Nelson and Siegel decomposition: where β 0 , β 1 , β 2 , λ 1 , and are defined as before, and β 3 is a new factor with the exponential decay parameter λ 2 . Svensson (1994) notes that the inclusion of another medium-term factor (β 3 ) and a second exponential decay parameter λ 2 significantly increases the flexibility of the model to fit more yield curve shapes. Due to its direct similarities to the Nelson and Siegel model, a dynamic version of the Svensson model can be easily constructed following the same steps described in the previous section.

Model Definition
A natural extension of the discussed modeling approaches is to try to learn the interest rate term structure decomposition directly from data. For a model with k factors, an intuitive approach is to estimate the entire measurement matrix H with M × k free parameters simultaneously with the rest of the Gaussian linear state-space model parameters. However, this approach suffers from two main disadvantages: forecasts cannot be computed for unobserved maturities without requiring an interpolation or extrapolation procedure, and, most importantly, the model does not induce similarity among interest rates with similar maturities, which is a fundamental property of yield curves. We refer to the latter as the stability property (or smoothness) of the factor loadings.
To tackle both of these issues while still preserving model flexibility, we propose the following yield curve factor model for k arbitrary factors: where m is the yield maturity, β t are the k yield curve factors at time t, and g θ (m) : R − → R k is a feedforward neural network (with parameter vector θ) that maps a yield maturity m to the corresponding vector of factor loadings: The structure of the neural network g θ guarantees continuous and differentiable factor loadings, allowing the evaluation of the yield curve at any given maturity, as in the dynamic Nelson and Siegel model. Continuous factor decompositions can be learned from data by estimating the parameter vector θ from the neural network g θ , but since neural networks are powerful general function approximators, we highlight the importance of restricting the network weights in order to produce stable factor loadings. A restriction in the form of a prior distribution over the neural network weights is described in the next section.
In order to simplify the model interpretation, we set φ where , the linear form of the model is preserved, and therefore, the Kalman filter (Kalman 1960) can be used to compute the conditional distribution of β t given the last t − 1 observations. This property allows the direct computation of the posterior density function of the model, which is given by where N is the Gaussian probability density function. The terms β t|t−1 = E(β t |y 1 , ..., y t−1 , Θ) and P t|t−1 = Var(β t |y 1 , ..., y t−1 , Θ) are obtained directly from the Kalman recurrence equations. This procedure is also known as the prediction error decomposition.
Point estimates for the model parameters can be obtained by maximizing the posterior density function with gradient-based optimization methods. The gradient ∇ Θ p(Θ|y) can be efficiently computed for any given differentiable neural network architecture using automatic differentiation algorithms (Rall 1986;Wengert 1964), which can outperform numerical differentiation algorithms, especially in high-dimensional parameter spaces. The cost of computing the Kalman filter in each evaluation of the posterior density function can also be reduced using the parallel algorithm proposed by Särkkä and García-Fernández (2019), which achieves logarithmic span-complexity when multiple processing units are available.
Despite the similarities to Gaussian linear state-space models, performing fully Bayesian inference that relies on sampling methods for the model parameters can be challenging, since neural network models are usually overparameterized and suffer from low parameter identifiability. In recent years, there has been an increasing amount of literature on approximate inference methods for quantifying uncertainty in Bayesian neural networks (see Filos et al. 2019 for a benchmark and discussion of current approaches), but it is not in the scope of this work to provide accurate probabilistic forecasts. Nevertheless, we note that this limitation may pose a challenge for practitioners interested in yield curve forecasting from a risk management perspective.

Specification of Prior Distribution
Neural networks are known to typically require a large number of parameters to model complex functions and to be usually prone to overfitting in the absence of large datasets or regularization mechanisms (Barron 1991). As (Neal 1992;MacKay 1992) show, the Bayesian framework can be successfully applied to neural networks, providing a variety of theoretically solid regularization techniques based on the choice of the prior distribution defined over the neural network weights. One common choice of the prior distribution is the zero-centered uncorrelated normal distribution: This prior distribution induces an L 2 weight penalty (also known as the ridge penalty Hoerl and Kennard (1970)) on the posterior distribution density function. The amount of shrinkage is controlled by the hyperparameter σ 2 NN , where smaller values of σ 2 NN induce larger penalties.
Another well-studied use of Bayesian shrinkage priors is in vector autoregressive (VAR) models, where the number of parameters can also be excessive in high dimensions. One commonly chosen prior distribution is the "Minnesota prior" (Litterman 1986), which assigns a normal distribution centered around a random walk to the VAR coefficients. This hypothesis is especially useful for yield curve forecasting, since the random walk model is usually a tough benchmark to beat, especially over short horizons (see Diebold and Li 2006;Ang and Piazzesi 2003, for instance).
Instead of the empirical Bayes approach described by Litterman (1986), we use a common variation of the Minnesota prior based on the normal and inverse gamma conjugacy: where the state equation noise term covariance matrix P is a diagonal matrix of components P 1 , ..., P k , f ij is a term in the transition matrix F, and the quantities u ij and v ij are defined as The hyperparameter λ controls the strength of the support for the random walk hypothesis, while γ controls the importance of the relationship between different factors relative to λ.
In the next section, we perform an empirical comparison of the out-of-sample forecasting performance between the proposed model and approaches based on the dynamic Nelson and Siegel model. At the end of Section 4, the impact of the prior distribution hyperparameters is also briefly analyzed.

Experimental Setup
In order to empirically assess the forecasting performance of the proposed technique, we use the Brazilian yield curve constructed by the B3 Brazilian Stock Exchange using future interbank deposit rates, as commonly used in the Brazilian term structure modeling literature (see Caldeira et al. 2010;Vicente and Tabak 2008;Cajueiro et al. 2009, for instance). We consider 15 years of Brazilian yield curve data from August 2003 to August 2018 using the 11 maturities at 1, 2, 3, 6, 12, 24, 36, 48, 60, 72, and 84 business months. The missing yield curve vertices are interpolated using the "flat-forward" method (Maltz 2002).
For this experiment, we benchmark out-of-sample forecasts obtained with the proposed neural network augmented state-space model (NNSS) against the following classical modeling approaches: the random walk model (RW), the dynamic Nelson and Siegel model (DNS), and the dynamic Nelson, Siegel, and Svensson model (DSV). The DNS and DSV models are both evaluated using the two-step and one-step parameter estimation approaches. In order to assess whether the inclusion of the neural network component is beneficial to the proposed model, we also benchmark a complete Gaussian linear statespace model (GLSS), where the entire measurement matrix H is estimated directly with the remaining parameters. To verify whether four-factor models are sufficient, we test both NNSS and GLSS models using four and five latent factors.
In the two-step estimation approach for the DNS and DSV models, following the method described by Nelson and Siegel (1987), a grid search procedure is used to choose the exponential decay parameters that minimize the total sum of squares of the real yield curve and the curve reconstructed from the decomposition factors ∑ T t=1 ||y t − H λβt || 2 2 . The obtained values are also used in the respective one-step models.
To increase numerical stability and accelerate convergence, we initialize the one-step DNS and DSV with the estimates obtained in the two-step approach. For the same purpose, we also initialize the measurement matrix H of the GLSS model with the factor loading matrix used in the DSV model. Similarly, the neural network parameter vector θ of the NNSS is also initialized with a solution that closely reproduces (in the squared error sense) the DSV factor loadings. This is carried out by pre-training the neural network g θ with that objective.
In all NNSS models, we use a neural network with two hidden layers, both with 300 hidden units intercalated with hyperbolic tangent activation functions, which are smooth and have good convergence properties (LeCun et al. 2012). To help interpret the model factors and obtain factor loadings that are similar to those of the Nelson and Siegel model, we use a sigmoid activation function in the last layer of the neural network, which produces factor loading functions bounded in the [0, 1] interval.
The models based on the one-step approach are estimated using gradient-based optimization methods with the Pytorch deep learning library (Paszke et al. 2019) and the parallel-scan Kalman filter algorithm (Särkkä and García-Fernández 2019) implemented in the Pyro probabilistic programming library (Bingham et al. 2019) in the Python programming language.
The prior distribution hyperparameters of the NNSS model are chosen based on the results obtained using the one-step DSV model and with the aid of simulation procedures. In this experiment, we use weakly informative prior distributions in order to assess the model behavior in the absence of strong external influences. A more comprehensive analysis of the impact of the prior distribution on model performance is presented at the end of the results section.
For the hyperparameters of the inverse gamma distribution, we select (a, b) = (0.1, 0.001) such that the resulting prior is weakly informative, and the mode of the distribution is close to the estimated state transition variances of the DSV model. The remaining hyperparameters of the Minnesota prior (λ, γ) = (0.5, 0.9) are chosen in a qualitative fashion by simulating the resulting prior transition matrices and determining whether the samples generally match our expected results while still preserving variability. Analogously, the same process is used to choose the prior variance of neural network weights σ 2 NN = 0.05 (Figure 1).

Forecast Evaluation
Given the computational complexity of the NNSS model, we evaluate all of the described modeling approaches using only the first 12 years of data to estimate model parameters once and the remaining 3 years to evaluate the out-of-sample forecasts for 1 week, 1 month, 3 months, and 6 months ahead. For the models based on the two-step approach, the h-step ahead forecasts at time t are calculated by evolving the extracted factors available at t to step t + h, which is carried out using the estimated VAR model. Then, the resulting prediction is converted to yield curve format using the decomposition equation.
For the models based on the one-step estimation approach, the procedure is similar, but it requires the extra step of updating the Kalman filter at the end of each iteration. The complete procedure is described below for each timestep t: 1. Obtain the forecastF hβ t|t of the t + h yield curve factors from t using the filtered factorsβ t|t = E(β t |y 1 , ..., y t , Θ) and the estimated transition matrixF; 2. Convert the forecasted factors using the estimated measurement matrixĤ to obtain the prediction of the complete yield curve from t to t + h: 3. Update the Kalman filter recurrence equations using the next observation y t+1 to obtainβ t+1|t+1 .
Then, the forecasts are evaluated using the root-mean-square error metric (RMSE), which is calculated for each considered maturity m and horizon h as: where t 0 denotes the index of the first out-of-sample observation,ŷ t+h t (m) is the forecast from t to t + h of the yield curve for maturity m, and T is the total number of observations in the dataset.

Results
Tables 1-4 report the RMSE of the out-of-sample forecasts of the nine considered models for the four forecast horizons. For each model, we report error metrics for each of the 11 considered maturities as well as the average and standard deviation.
In the 1-week ahead forecasts (Table 1), all models except for the 5-GLSS model and the dynamic Nelson and Siegel model achieved better overall scores than the random walk (RW) method, which is typically a tough benchmark to beat over short horizons. The five-factor NNSS model achieved a slightly lower average RMSE score, which is followed closely by the four-factor NNSS model.
In the 1-month ahead forecast (Table 2), the NNSS models still achieved the lowest average RMSE, but we note that the DSV models performed better than the rest of the contenders. Overall, among all of the models, the DSV models had the best performance on short-and medium-term maturities, while the NNSS models had the best performance on longer maturities.
A similar pattern can be identified in the 3-month ahead forecast results, although the performance gap between the NNSS models and the other candidates becomes more noticeable. Surprisingly, for the DSV model, the two-step approach has perceptibly lower RMSE scores than the one-step version. We also note that the performance of the GLSS models decreased significantly, remaining only above that of the DNS models and the random walk method.
In the longest considered forecast horizon of 6 months, the relative performance of the NNSS models with respect to the other candidates was greatly improved, especially for the five-factor model. We also observe that the two-step DSV model and the 5-GLSS model achieved good performance.
The estimated yield curve factor loadings of the DNS, DSV, GLSS, and NNSS models are plotted in Figure 2. As expected, the factor loadings of the GLSS models present nonsmooth behavior, especially the four-factor GLSS model. Despite being initialized with the factor loadings of the DSV model, the estimated loadings of the GLSS models seem to have lost the original asymptotic behavior of the Nelson and Siegel decomposition, which can make the process of extrapolating the factor loadings outside of the original range of maturities difficult and unreliable in a real-world setting.
Unlike the GLSS models, the factor loadings of both four-and five-factor NNSS exhibit a smooth and stable pattern. We also note that, as in the Nelson and Siegel decomposition, the obtained factor loading functions display asymptotic stable behavior, which is a well-known stylized fact about yield curves. We attribute this property of the NNSS model to the choice of the sigmoid activation function in the output layer of the neural network g θ . Table 1. Out-of-sample RMSE × 10 3 of 1-week ahead forecasts (lowest scores in bold).

Model
Maturity ( Table 3. Out-of-sample RMSE × 10 3 for 3-month ahead forecasts (lowest scores in bold).  2. Estimated factor loadings of the DNS and DSV models, 4-and 5-factor GLSS models, and 4-and 5-factor NNSS models. The factor loadings for the GLSS models are linearly interpolated within the range of the 11 maturities considered in the dataset; factor loadings for maturities outside of this range are not extrapolated.

Model
In some ways, the interpretation of some of the yield curve factors from the 4-NNSS model is similar to that of the Nelson, Siegel, and Svensson model. For instance, factors 1 and 2 can be interpreted as two medium-term factors with peaks around the 1.5-and 3-year maturities, although the first factor loading exhibits a steeper growth and decay rate as well as a slight bump around the shorter maturities. Another notable difference between the 4-NNSS and DSV factors can be seen in the third loading function, which resembles a mirrored version of the exponential decay pattern of the first Nelson and Siegel loading. Therefore, the intercept term β 0 in the 4-NNSS model can be interpreted as a short-term factor.
In the five-factor version of the NNSS model, the interpretation of the factors becomes slightly convoluted, but it is possible to identify a short-term factor (factor 1) and three different kinds of medium-and long-term factors (factors 2, 3, and 4). Similar to the first loading of the 4-NNSS model, we also observe a bump in the fourth loading of the 5-NNSS model located at the short end of the yield curve, which indicates that this factor also has some influence on shorter maturities. Figure 3 presents the time-series of the filtered yield curve factors for each considered model. The extracted factors of the 4-GLSS and 5-GLSS models exhibit a great amount of variability, especially in the second and third factors. Among all of the evaluated decompositions, the yield curve factors from the NNSS models are the most stable, which indicates that the learned decompositions are useful for the forecasting task. Figure 3. Filtered yield curve factorsβ t|t = E(β t |y 1 , ..., y t ) from the dataset for the one-step DNS, one-step DSV, 4-and 5-factor GLSS, and 4-and 5-factor NNSS.

Impact Analysis of the Prior Distribution Hyperparameters
In order to empirically assess the impact of the specified prior distribution in the proposed model, we repeat the forecast experiment described in the previous section for the 4-NNSS model with a variable set of hyperparameters. The four-factor version of the NNSS model was used due to the faster convergence of the model parameters and easier factor loading interpretation.
The experimental results reported in Table 5 verify that the Minnesota prior distribution significantly impacted the overall forecasting performance of the model, especially over longer horizons. The forecast accuracy dropped significantly under the strongest penalty tested (γ = 0.1), although a medium penalty value (γ = 0.5) produced better results than the weakly informative prior used in the previous experiment (γ = 0.9). Table 5. Average out-of-sample forecast RMSE scores 1 for the 4-NNSS model estimated with each combination of the prior hyperparameters γ ∈ [0.1, 0.5, 0.9] and σ 2 NN ∈ [5 × 10 −3 , 1 × 10 −2 , 5 × 10 −2 , 1 × 10 −1 ] while keeping (a, b) = (0.1, 0.001) and λ = 0.5 fixed. The neural network shrinkage prior slightly affected the overall forecast performance; in the models estimated with γ ∈ [0.5, 0.9], a stronger penalty (lower value for σ 2 NN ) produced a small but consistent gain in forecast performance. The factor loading plot for each tested model is included in Appendix A.

Conclusions
In this paper, we proposed a yield curve forecasting model that learns continuous and smooth decompositions directly from data using a neural network. We showed how the neural network can be integrated into a Gaussian linear state-space model and presented a joint estimation procedure to obtain maximum a posteriori (MAP) estimates for decomposition and autoregressive parameters.
In the empirical evaluation, we obtained better overall out-of-sample forecast performance than traditional approaches on 14 years of Brazilian yield curve data, especially for longer maturities and forecast horizons. By analyzing the generated decompositions, we found that the model was able to reproduce the advantageous properties of traditional yield curve models, such as the smoothness, asymptotic stability, and economic interpretability of the factor loadings, and that the learned decompositions produced yield curve factors that were more stable and easier to forecast than those obtained with traditional approaches.
Finally, we analyzed the role of the prior distribution hyperparameters in the outof-sample forecasting performance and found that the shrinkage effect on the model parameters can significantly influence the quality of the produced forecasts, especially over longer time horizons. We found that stronger penalties for the weights in the neural network (σ NN = 5 × 10 −3 ) and medium-level penalties in the state equation parameters (γ = 0.5) resulted in the best overall prediction performance in the experiments performed.
Given the direct similarities, we note that a wide range of extensions developed for the dynamic Nelson and Siegel model can also be adapted to the technique proposed in this paper, such as the inclusion of macroeconomic variables in the state equation ) and multicountry modeling (Diebold et al. 2008), which may be interesting topics for future research.
Another possible line of research concerns the optimization method used to obtain point estimates for the model parameters, since pure gradient-based optimization methods in state-space models can be unstable and require the careful selection of step sizes. As commonly used in the dynamic factor modeling literature, the Expectation-Maximization (EM) algorithm can be a robust alternative that usually approaches an optimal neighborhood very quickly, but it requires many iterations to reach full convergence. As noted by Watson and Engle (1983), the EM algorithm can be used to quickly provide good starting points to a gradient-based optimization algorithm, combining the best parts of both approaches. For the method proposed in this paper, the maximization step of the EM algorithm would also require the internal use of iterative methods, given the non-linear nature of the neural network component, but it could effectively increase the robustness of the overall optimization procedure.
The computer code used in this work, written in the Python programming language, is available in the Supplementary Materials.

Conflicts of Interest:
The authors declare no conflict of interest.