Multiple Time Series Forecasting Using Quasi-Randomized Functional Link Neural Networks

We are interested in obtaining forecasts for multiple time series, by taking into account the potential nonlinear relationships between their observations. For this purpose, we use a specific type of regression model on an augmented dataset of lagged time series. Our model is inspired by dynamic regression models (Pankratz 2012), with the response variable’s lags included as predictors, and is known as Random Vector Functional Link (RVFL) neural networks. The RVFL neural networks have been successfully applied in the past, to solving regression and classification problems. The novelty of our approach is to apply an RVFL model to multivariate time series, under two separate regularization constraints on the regression parameters.


Introduction
In this paper, we are interested in obtaining forecasts for multiple time series, by taking into account the potential nonlinear relationships between their observations.This type of problem has been tackled recently by Exterkate et al. (2016), who applied kernel regularized least squares to a set of macroeconomic time series.The Long Short-Term Memory neural networks (LSTM) architectures (introduced by Hochreiter and Schmidhuber (1997)) are another family of models, which are currently widely used for this purpose.As a basis for our model, we will use (quasi-)randomized neural networks known as Random Vector Functional Link neural networks (RVFL networks hereafter).
The forecasting method described in this paper, provides useful inputs for Insurance Quantitative Risk Management models; the interested reader can refer to Bonnin et al. (2015) for an example.
To the best of our knowledge, randomized neural networks were introduced by Schmidt et al. (1992), and the RVFL networks were introduced by Pao et al. (1994).An early approach for multivariate time series forecasting using neural networks is described in Chakraborty et al. (1992).They applied a back propagation algorithm from Rumelhart et al. (1988) to trivariate time series, and found that the combined training of the series gave better forecasts than a separate training of each individual series.The novelty of the approach described in this paper is to derive an RVFL model for multiple time series, under two separate regularization constraints on the parameters, as it will be described in detail in Section 2.3.
RVFL networks are multilayer feedforward neural networks, in which there is a direct link between the predictors and the output variable, aiming at capturing the linear relationships.In addition to the direct link, there are new features: the hidden nodes (the dataset is augmented), that help to capture the nonlinear relationships between the time series.These new features are obtained by random simulation over a given interval.More details on the direct link and the hidden nodes will be provided in the next section.
The RVFL networks have been successfully applied to solving different types of classification and regression problems, see for example Dehuri and Cho (2010).More specifically, they have been applied to univariate time series forecasting by Ren et al. (2016).A comprehensive survey can be found in Zhang and Suganthan (2016), where a large number of model specifications are tested on classification problems, including changing the range of hidden layer's randomization.
Here, we will use RVFL networks with one hidden layer.Furthermore, instead of relying on fully randomized nodes, we will use sequences of deterministic quasi-random numbers.Indeed, with fully randomized nodes, the model fits obtained are dependent on the choice of a simulation seed.Typically, a different fitting solution would be obtained for each seed.
In our various numerical examples from Section 3, we will apply the RVFL networks to forecasting trivariate time series, notably (but not only) in a Dynamic Nelson Siegel (DNS) framework (see Nelson and Siegel 1987;Diebold and Li 2006).We will obtain point forecasts and predictive distributions for the series, and see that in this RVFL framework, one (or more) variable(s) can be stressed, and influence the others.More precisely, about this last point, it means that it is possible, as in dynamic regression models (Pankratz 2012) to assign a specific future value to one regressor, and obtain forecasts of the remaining variables.Another advantage of the model described here is its ability to integrate multiple other exogenous variables, without overfitting in-sample data.

Description of the Model
The general procedure for obtaining the model's optimal parameters and predictions is summarized in Figure 1.This procedure is described in detail in the next sections, especially in Sections 2.3 and 2.4.

On a Single Layer RVFL Network
We rely on single layer feed forward neural networks (SLFN).Considering that an output variable y ∈ R n is to be explained by a set of observed predictors Z (j) ∈ R n , j ∈ {1, . . . ,p}, the RVFL networks we will use to explain y can be described for i ∈ {1, . . . ,n} as: g is called activation function, L is the number of nodes in the hidden layer, W (j,l) are elements of the hidden layer, and the parameters β j and γ l are to be learned from the observed data Z (j) , j ∈ {1, . . . ,p}.The i 's are the residual differences between the output variable values and the RVFL model.This type of model can be seen as one explaining y i , by finding a compromise between linear and potentially non-linear effects of the original predictors Z (j) and transformed predictors . . ,L} on the response.Common choices for function g in neural networks regression are the sigmoïd activation function the hyperbolic tangent function, or the Rectified Linear Units, known as ReLU The main differences between the RVFL framework and a classical SLFN framework are: • The inclusion of a linear dependence between the output variable and the predictors: the direct link, i .

•
The elements W (j,l) of the hidden layer are typically not trained, but randomly and uniformly chosen on a given interval.Different ranges for these elements of the hidden layer are tested in (Zhang and Suganthan 2016).
Solving for the optimal parameters β j 's and γ l 's can be done by applying directy a least squares regression of y on the set of observed and transformed predictors.But since these input predictors are likely to be highly correlated-especially in our setting, with time series data-we do not search each of these parameters on the entire line, but in restricted regions where we have: That is, applying some kind of Tikhonov regularization or ridge regression model (Hoerl and Kennard 1970) of y on the set of observed and transformed predictors.Having two constraints instead of one allows for more flexibility in the covariance structure between the predictors and the output, with β j 's and γ l 's moving in separate balls.For these constraints to be applicable, the input variables will need to be standardized, so as to be expressed on the same scales, and the response variable will be centered.Imposing these restrictions to the model's parameters increases their interpretability-by reducing their variance-at the expense of a slight increase in in-sample bias.It also prevents the model from overfitting the data as in ridge regression (Hoerl and Kennard 1970).One of the advantages of RVFL networks is that they are relatively fast to train, due to the availability of closed-form formulas for the model's parameters, as it will be presented in the next section.
On the other hand, RVFL networks incorporate some randomness in the hidden layer, which makes each model relatively dependent on the choice of a simulation seed.Each seed would indeed produce a different set of parameters β j 's and γ l 's for the model.For that reason, we will also use sequences of deterministic quasi-random numbers in the hidden layer.The elements W (j,l) of the hidden layer are taken from a quasi-random (deterministic) sobol sequence on [0, 1], which is shifted in such a way that they belong to [−1, 1].
Sobol sequences are part of quasi-random numbers, which are also called low discrepancy sequences.As described intuitively in Boyle and Tan (1997), the discrepancy of a sequence of N points in a subcube V ∈ [0, 1) d is defined as: where v(V) is the volume of V.It describes how well the points are dispersed within V. The idea is to have points which are more or less equidispersed in V. Joe and Kuo (2008) describe the generation of the i th term, j th component (x i,j ) of a Sobol sequence.The generation starts with obtaining the binary representation of i.That is, obtaining i as: For example, 5 = 1 × 2 2 + 0 × 2 1 + 1 × 2 0 can be expressed as (101) 2 in binary representation.Then, by using the sequence of bits describing i in base 2, we can obtain x i,j as: where ⊕ is a bitwise exclusive-or operation, and the numbers v i,j are called the direction numbers, defined for k ≥ 1 as: A few details on Equation (2): • The bitwise exclusive-or operation ⊕ applied to two integers p and q ∈ {0, 1} returns 1 if and only if one of the two (but not both) inputs is equal to 1. Otherwise, p ⊕ q is equal to 0.

•
The second term of the equation relies on primitive polynomials of degree s j , with coefficients a i,j taken in {0, 1}: • The terms m k,j are obtained recursively, with the initial values m 1,j , m 2,j , . . ., m k−s j ,j chosen freely, under the condition that m k,j , 1 ≤ k ≤ s j is odd and less than 2 k .
A more complete treatment of low discrepancy sequences can be found in Niederreiter (1992).In addition, an example with s j = 3, a 1,j = 0, a 2,j = 1 is given in Joe and Kuo (2008).

Applying RVFL Networks to Multivariate Time Series Forecasting
We consider p ∈ N * time series (X (j) t ) t≥0 , j = 1, . . ., p, observed at n ∈ N * discrete dates.We are interested in obtaining simultaneous forecasts of the p time series at time n + h, h ∈ N * , by allowing each of the p variables to be influenced by the others (in the spirit of VAR models, see Lütkepohl 2005).
For this purpose, we use k < n lags of each of the observed p time series.The output variables to be explained are: for j ∈ {1, . . . ,p}.Where X (j) n is the most contemporaneous observed value of the j-th time series, and X (j) k+1 was observed k dates earlier in time for (X (j) t ) t≥0 .These output variables are stored in a matrix: and the predictors are stored in a matrix: where X consists in p blocks of k lags, for each one of the observed p time series.For example, the j th 0 block of X, for j 0 ∈ {1, . . . ,p} contains in columns: with i ∈ {1, . . . ,k}.It is also possible to add other regressors, such as dummy variables, indicators of special events, but for clarity, we consider only the inclusion of lags.
As described in the previous section, an additional layer of transformed predictors is added to X, in order to capture the potentially non-linear interactions between the predictors and the output variable.Adding the transformed predictors to the original ones leads to a new matrix of predictors with dimensions (n − k) × (k × p + L), where L is the number of nodes in the hidden layer.We are then looking for simultaneous predictions X(j) n+h|n,...,1 =: X(j) n+h for h ∈ N * , and j ∈ {1, . . . ,p}.This is a multi-task learning problem (see Caruana 1998), in which the output variables will all share the same set of predictors.
For example, we have p = 2 time series (X (1) t 5 ) and (X (2) t 5 ) observed at n = 5 dates t 1 < . . .< t 5 , with k = 2 lags, and L = 3 nodes in the hidden layer.In this case, the response variables are stored in: The predictors are stored in: And the coefficients in the hidden layer are:

Solving for β's and γ's
We let y be the j th 0 column (out of p) of the response matrix Y, and Φ(X) be the matrix of transformed predictors obtained from X by the hidden layer described at the beginning of Section 2.1.We also denote the set of regression parameters associated with this j th 0 time series, as: for m ∈ {1, . . . ,k}; l ∈ {1, . . . ,L}. Solving for the regression parameters for the j th 0 time series, under the constraints for u, v ∈ R * , leads to minimizing a penalized residual sum of squares.Hence, for vectors β ∈ R (k×p)  and γ ∈ R L containing the regression parameters, we obtain the Lagrangian: where λ 1 and λ 2 are Lagrange multipliers.Taking the first derivatives of L relative to β and γ leads to: where I k×p is the identity matrix with dimensions (k × p) × (k × p) and equivalently where I L is the identity matrix with dimensions L × L. And setting these first derivatives to 0 leads to: That is: Now, if we denote: and S = D − CB + C T .Then, using the algorithm described in Cormen (2009) for blockwise matrix inversion, we obtain: where S + and B + are the Moore-Penrose pseudo-inverse (Penrose 1955) of matrixes S and B. Hence for each column y of Y, we have the solutions: And the whole set of parameters, for all the p observed time series is given by: The objective function to be minimized (the least squares) is convex, and so is the set of feasible solutions.The solutions β and γ found here are hence global minima.

h-Steps Ahead Forecasts and Use of Dynamic Regression
Having obtained the optimal set of parameters β and γ as described in the previous section, a new set of predictors is constructed by using the former output variables contained in response matrix Y's columns.The first k elements of each one of the p columns of Y, which are the most contemporaneous values of the p series, constitute the new predictors.
Hence, if we denote the new predictors as: n−k+1 , . . ., . . ., X The 1-step ahead forecasts are obtained as: The h-step ahead forecasts are obtained in a similar fashion; with the new forecasts X(1) n+1 , . . ., X(p) n+1 being added to the set of most contemporaneous values of the p series, and used as part of the new predictors.
In order to obtain confidence intervals around the point forecasts, we fit an ARIMA model to the in-sample residuals i of each one of the p time series, as in dynamic regression models (see Pankratz (2012)).An illustration can be found in the next section.Other models for the autocorrelated residuals could be envisaged, though.

A Dynamic Nelson-Siegel Example
The following examples are not exhaustive benchmarks, but aim at illustrating the forecasting capabilities of the model described in this paper 1 .
All the results on RVFL use the ReLU activation function.
We use calibrated discount rates data from Deutsche Bundesbank website (http://www.bundesbank.de/Navigation/EN/Statistics/Time_series_databases/time_series_databases.html),observed on a monthly basis, from the beginning of 2002 to the end 2015.There are 167 curves, observed at 50 maturities in the dataset.We obtain curves' forecasts in a Dynamic Nelson Siegel (Nelson and Siegel 1987) framework (DNS), in the spirit of Diebold and Li (2006) and other similar models 2 . 1 Dutang and Savicky (2015) and Wickham (2016) were used extensively for these examples.
2 Diebold and Rudebusch (2013): "there are by now literally hundreds of DNS applications involving model fitting and forecasting".
In Figure 2, we present the data that we use, and Table 1 contains a summary of these data; the minimum, maximum, median, first and third quartiles of the discount rates observed at given maturities.There are alternate cycles of increases and decreases of the discount rates, with a generally decreasing trend.Some of the discount rates, at the most recent dates, and lower maturities, are negative.In the DNS framework, the spot interest rates observed at time t, for time to maturity τ are modeled as: The factor loadings 1, 1−e −T/λ e −T/λ and 1−e −T/λ e −T/λ − e −T/λ are used to represent the level, slope, and curvature of the Yield Curve.We obtain estimations of α i,t , i = 1, . . ., 3 for each cross-section of yields by fixing λ, and doing a least squares regression on the factor loadings.The three time series α i,t , i = 1, . . ., 3 associated to the loadings for each cross-section of yields, are those that we wish to forecast simultaneously, by using an RVFL model.This type of model (DNS) cannot be used for no-arbitrage pricing as is, but it could be useful for example, for stressing the yield curve factors under the historical probability.It can however be made arbitrage-free, if necessary (see Diebold and Rudebusch 2013).We will benchmark the RVFL model applied to the three time series α i,t , i = 1, . . ., 3, against ARIMA (AutoRegressive Integrated Moving Average) and VAR (Vector AutoRegressive) models.Diebold and Li (2006) applied an autoregressive AR(1) model separately to each one of the parameters, α i,t , i = 1, . . ., 3.
We will apply to these parameters' series: an ARIMA model (Hyndman and Khandakar 2008), and a Vector Autoregressive model (VAR, see Pfaff et al. 2008;Lütkepohl 2005); with the parameter λ of the DNS factor loadings, used as an hyperparameter for the time series cross-validation.In the RVFL and the VAR model, the number of lags is also used as an hyperparameter for the cross-validation.For the RVFL, the most recent values of α i,t , i = 1, . . ., 3 are stored in matrix Y, as described in Section 2.3, ordered by date of arrival, whereas matrix X contains the lags of the three series.
A rolling forecasting methodology (see Bergmeir et al. 2015) is implemented in order to obtain these benchmarks.A fixed 12 months-length window is established for training the model, and the following 12 months are used for testing; the origin of the training set is then advanced of 1 month, and the training/testing procedure is repeated.The measure of forecasting performance is the Root Mean Squared Error (RMSE).Figure 3 presents boxplots for the distribution of out-of-sample errors obtained in the cross-validation procedure, and Figure 4 presents the 12 months-ahead out-of-sample errors over time.ARIMA (separate (Hyndman and Khandakar 2008) ARIMA models applied to each series α i,t , i = 1, . . ., 3) gives good results, as already suggested by Diebold and Li (2006).They are nearly comparable to results from RVFL, but a bit more volatile, with an outlier point observed on the log(RMSE) box plot.
The unrestricted VAR model results include more volatility than the two other methods on this specific example, especially in the period of financial crisis between 2007 to 2009, as seen on Figure 4. Table 2 is to be read in conjuction with the log(RMSE) box plot presented in Figure 3.It summarises the results obtained by the different methods on the out-of-sample RMSE.Table 3 contains 95% confidence intervals around the mean of the differences between the three methods.Another advantage of RVFL over ARIMA or AR(1) in this context is that it would be possible to add other variables to the RVFL regression, such as inflation, or dummy variables for external events, and combine their effects.It is also possible to stress one variable, and see the effects on the other variables, as presented in the Appendix A.1: the parameter α 1,t is increased (from 0.75 to 1.25) and decreased (from 0.75 to 0.25), and the other parameters α 2,t and α 3,t forecasts move slightly, consecutively to these stresses.The corresponding median forecast curves for these stresses, and some additional ones, are presented in Figure 5.

Forecasting 1 Year, 10 Years and 20 Years Spot Rates
For this second example, we forecast the 1-year, 10-year and 20-year spot rates time series from the previous dataset, on a 12-month horizon.As described in the previous section, we use a rolling forecasting methodology, with a training window of 12 months' length.
Figure 6 presents the three time series of data, and a summary of the data can be found in Tables 4 and 5.The three time series globally exhibit a decreasing trend, and are highly positively correlated.The spot rates for short-term maturities can also be negative, as it has been observed recently in 2016.The spreads between the spot rates time series are extremely narrow during the 2007-2009 crisis.The tables below (Tables 6 and 7) contain the results of a comparison between the RVFL model and an unrestricted VAR model (with one lag, best parameter found) on the forecasting problem.The best RVFL model, with the lowest out-of-sample RMSE, uses one lag, four hidden nodes, and λ 1 = 5.80, λ 2 = 19.66.In this third example, as in Section 3.1, we apply the DNS framework to the forecasting of spot rates.However, in this example we have a longer training set (36 months), and a longer horizon for the test set (36 months as well).We use interest rate swaps data from the Federal Reserve Bank of St Louis website 3 , observed on a monthly basis from July 2000 to September 2016, with maturities equal to 1, 2, 3, 4, 5, 7, 10, 30, and a tenor equal to three months.
In Figure 7, we present three of the eight time series of swap rates, observed for time to maturities equal to 3, 10 and 30.The swap rates for different maturities generally exhibit a decreasing trend, and are nearly equal to 0 by the end of 2016, for the shortest maturities.We also observe that the spreads between swap rates with different maturities start to narrow in 2006 until the end of 2007, and the swap rates for short term maturities are relatively high during the same period.This is the period corresponding to the Liquidity and Credit Crunch 2007-2008.Table 8 presents the descriptive statistics for these three time series.All the swap rates (for all the maturities available) were then transformed into zero rates, by using a single curve calibration methodology (that is, ignoring the counterparty credit risk) with linear interpolation between the swaps' maturities.Then, the Nelson & Siegel model was used for fitting and forecasting the curves in a DNS framework, with both auto.arima and the RVFL model presented in this paper, applied to the three factors.In the fashion of Section 3.1.However, now we obtain 36-month's ahead forecasts, from a rolling training windows with a fixed 36 month's length.The average out-of-sample RMSE are then calculated for each method.
The best hyperparameters-associated with the lowest out-of-sample average RMSE-for each model are obtained through a search on a grid of values.We have: With these parameters, the results detailed in Table 9 are obtained, for the out-of-sample RMSE.A 95% confidence interval around the difference of out-of-sample RMSE between ARIMA (applied to each one of the three factors) and RVFL is presented in Table 10.   Figure 9, presents the convergence of the out-of-sample log(RMSE) for the DNS + RVFL model from this section, as a function of log(λ 1 ) and log(λ 2 ).λ 1 and λ 2 both range from 10 −2 to 10 4 , with 10 equally-spaced points each (hence, a grid of 100 points (log(λ 1 ), log(λ 2 ), log(RMSE)).
The number of nodes in the hidden layer is equal to 45, and the value of λ, parameter from the Nelson and Siegel (1987) model presented in Section 3.1, is fixed and equal to 24.The one hundred points (log(λ 1 ), log(λ 2 ), log(RMSE)) that we use for Figure 9 can be found in Appendix A.3.
There is a rectangular region at the top, in the middle of the figure, where the log(RMSE) is the lowest.In this region, the lowest value of the out-of-sample log(RMSE) is observed for λ 1 = 4.6416 and λ 2 = 464.1589and the out-of-sample RMSE is equal to 0.01206 (75th point in Appendix A.3).

Figure 1 .
Figure 1.General procedure for multiple time series forecasting with RVFL.

Figure 2 .
Figure 2. Observed discount rates from Deutsche Bundesbank website, from 2002 to the end 2015.

Figure 8
Figure 8 presents the evolution of the out-of-sample log(RMSE) over the training/testing windows.The grey rectangle indicating the Liquidity and Credit crunch is larger here, because in this example, a training set starting in 2004 has its test set starting 36 months later, in 2007.Again, we observe that the results from the RVFL model exhibit a low out-of-sample error, along with a low volatility.

Table 1 .
Summary of observed discount rates from Deutsche Bundesbank website, from 2002 to the end 2015.

Table 3 .
95% confidence interval around the difference of out-of-sample RMSE.

Table 4 .
Summary of the data for 1-year, 10-years and 20-years spot rates time series (in %).

Table 5 .
Summary of data for 1-year, 10-year and 20-year spot rates time series.

Table 6 .
Comparison of 12 months ahead out-of-sample RMSE, for the RVFL, and VAR.

Table 7 .
95% confidence interval around the difference of out-of-sample RMSE.

Table 8 .
Descriptive statistics of St Louis Federal Reserve data for 1-year, 10-years and 30-years swap rates (in %).

Table 9 .
Descriptive statistics for out-of-sample RMSE, with rolling training window = 36 months, and testing window = 36 months.

Table 10 .
95% confidence interval around the difference of out-of-sample RMSE.