A Novel Neuro-Fuzzy Model for Multivariate Time-Series Prediction †

Time series forecasting can be a complicated problem when the underlying process shows high degree of complex nonlinear behavior. In some domains, such as financial data, processing related time-series jointly can have significant benefits. This paper proposes a novel multivariate hybrid neuro-fuzzy model for forecasting tasks, which is based on and generalizes the neuro-fuzzy model with consequent layer multi-variable Gaussian units and its learning algorithm. The model is distinguished by a separate consequent block for each output, which is tuned with respect to the its output error only, but benefits from extracting additional information by processing the whole input vector including lag values of other variables. Numerical experiments show better accuracy and computational performance results than competing models and separate neuro-fuzzy models for each output, and thus an ability to implicitly handle complex cross correlation dependencies between variables.


Introduction
Time-series represent data points ordered in the time domain and they arise in many fields including economics, engineering, sociology, and medicine.Mathematical tools for time-series forecasting can crucially improve the decision-making process.Statistical models have dominated the field of quantitative time-series analysis for decades.These models are predominantly parametric and therefore require time-series to be weakly stationary, which is achieved through various differencing techniques, whose effectiveness is disputed [1,2] due to the highly nonstationary and nonlinear nature of real-life time-series.
The computational intelligence (CI) approach proposes nature-inspired models and methods that have been competing against classical statistical approaches in many applied domains.Among CI principal techniques, Artificial Neural Networks (ANN) are the most known and widely used.Many neural models have been applied to financial time-series forecasting [3][4][5].Crone et al. [4] Data 2018, 3, 62 2 of 14 proved the ability of such networks to represent complex data patterns, including highly frequent, seasonal, and short time-series.
Fuzzy logic and fuzzy sets theory have been extended to a variety of mathematical objects by introducing a gradual degree of membership and have influenced uncertainty analysis [6,7].Rule-based fuzzy models have been applied to various domains including stock market forecasting [8,9].However, they have some essential drawbacks primarily caused by their dependence on expert knowledge or complex rule extraction methods.
Neuro-fuzzy systems combine the strengths of ANNs and fuzzy inference systems and have been successfully applied to a variety of problems.Their advantages are especially visible in rapidly changing domains with a high degree of uncertainty, e.g., routing guidance and planning [10,11], electric networks loading forecasting [12,13], complex logistics [14], crude oil blending [15], human resources portfolio management [16], and many others.Kar et al. [17] provided a brief review of neuro-fuzzy model applications including forecasting.The Adaptive Network Inference System (ANFIS), proposed by Jang [18] was the first successfully applied neuro-fuzzy model and the majority of the later models are based on it.Recently, neo-fuzzy systems [19][20][21], which are based on zero-order additive models, have been gaining popularity in various domains.
In the field of stock market prediction, examples of successfully applied neuro-fuzzy models include the ANFIS model with a modified Levenberg-Marquardt learning algorithm for Dhaka Stock Exchange day closing price prediction [22] and an ANFIS model based on an indirect approach and tested on Tehran Stock Exchange Indexes [23].Rajab & Sharma [24] proposed an interpretable neuro-fuzzy approach to stock price forecasting applied to varies exchange series.Neuro-fuzzy prediction models have been combined with genetic algorithms [25], wavelet transform [26], support vector machines (SVM) [27], and recurrent connections in order to use dynamic memory [28,29].Such combined models could show high effectiveness, but they suffer from the additional costs of extra clustering procedures, layers, and hyper-parameters.
Generally, neuro-fuzzy systems require many rules to cover complex nonlinear relations.To solve this issue for scenarios where input variables show a certain degree of interdependence and correlation, in Ebadzadeh & Salimi-Badr [30], multivariable Gaussian functions were used in a neuro-fuzzy model for efficient handling of correlated input space regions, but prediction performance was measured on a chaotic time-series, where the behavior may drastically vary from real-life data.Not much attention was paid to the data noise handling.
The aforementioned works primarily focused on a single output scenario, so developing a compact model with a memory-wise architecture and effective learning that can handle complex cross-correlation dependencies between different input times series is a necessary task.The learning procedure should combine generalization and smoothing abilities in order to work on small and noisy datasets.
In this work, we present a novel neuro-fuzzy model and a corresponding learning procedure for time-series forecasting.This model has three advantages: (1) achieves better accuracy through using the representational abilities of the multivariable Gaussian functions in the fourth layer; (2) shows good computational performance, which is achieved by using a stochastic gradient optimization procedure for tuning consequent layer units and an iterative projective algorithm for tuning their weights; and (3) has a reasonable amount of hyperparameters that can be chosen.
The remainder of this paper is organized as follows: Section 2 describes the data set that was used for numerical experiments, Section 3 is devoted to the proposed model architecture and learning, and Section 4 contains experimental results.

Data Description
In order to verify our model performance, we used the log returns of IBM stock and the S&P 500 index from January 1926 to December 1999 with 888 monthly records as the dataset.Correspondingly, the dataset had two columns with numerical values: IBM represents the log returns of IBM stock, and SP column contains the S&P 500 index values.These two returns comprise a bivariate concurrently correlated time series [31].We chose this dataset as a framework due to its well-studied statistical properties and real-life nature.The original dataset can be found in Tsay [31,32] and Table 1 contains a randomly chosen block of 10 lines.Data were normalized in order to compare different models properly.Figure 1 shows the first 300 records in the normalized version of the dataset.
Data 2018, 3, x FOR PEER REVIEW 3 of 14 well-studied statistical properties and real-life nature.The original dataset can be found in Tsay [31,32] and Table 1 contains a randomly chosen block of 10 lines.Data were normalized in order to compare different models properly.Figure 1 shows the first 300 records in the normalized version of the dataset.

Architecture and Inference
The proposed model is a multi-output generalization of the model introduced in Vlasenko et al. [33,34], based on the classical ANFIS [18] model.It consists of five layers that are shown in the Figure 2. The main difference with the architecture described in Vlasenko et al. [33] is that we have m outputs ( ) and, correspondingly, m composite consequent layer units . Figure 2 depicts the modified architecture.

Architecture and Inference
The proposed model is a multi-output generalization of the model introduced in Vlasenko et al. [33,34], based on the classical ANFIS [18] model.It consists of five layers that are shown in the Figure 2. The main difference with the architecture described in Vlasenko et al. [33] is that we have m outputs ŷ(k) = ( ŷ1 (k), ŷ2 (k), ..., ŷm (k)) T instead of just one ŷ for each input pattern x(k) = (x 1 (k), x 2 (k), ..., x n (k)) T and, correspondingly, m composite consequent layer units where ( ) x k corresponds to a current input vector, jl c ψ is the centre, and jl σ is a variety or width parameter, which controls the width of the bell curve.
On the initialization step, the first layer membership function centers jl c ψ are placed at equidistant widths fixed at 0.15 jl σ = . An example with 5 h ψ = is shown in Figure 3.The fuzzification layer consists of h ψ membership functions for each input variable; hence, their total amount is h ψ × n.Gaussian membership functions are used-this type of membership function is also known as a radial basis function kernel, popular among a variety of machine learning techniques.It has the following form: where x(k) corresponds to a current input vector, c ψ jl is the centre, and σ jl is a variety or width parameter, which controls the width of the bell curve.
On the initialization step, the first layer membership function centers c ψ jl are placed at equidistant widths fixed at σ jl = 0.15.An example with h ψ = 5 is shown in Figure 3.
The third layer is non-parametrised and responsible for normalization.It also has h ψ units, which outputs calculated by the following formula: This fulfils the Ruspini condition, which states the sum of various membership grades of one element is one [5]: The fourth layer is represented by the consequent units a F that contain multidimensional Gaussian consequent functions where ( ) x k represents an input vector, aje c φ is a vector-centre of the current Gaussian, and aje Q is the receptive field (covariance) matrix.This function allows handling data distributed unevenly on the main axes.In case of a diagonal matrix, multidimensional Gaussian simply represents a collection of independent Gaussians with fixed width (Figure 4a).The second layer performs an aggregation of the antecedent premises.It comprises h ψ algebraic product fuzzy T-norm units g.Its outputs are computed by the formula: ( The third layer is non-parametrised and responsible for normalization.It also has h ψ units, which outputs calculated by the following formula: This fulfils the Ruspini condition, which states the sum of various membership grades of one element is one [5]: The fourth layer is represented by the consequent units F a that contain multidimensional Gaussian consequent functions φ aje (x(k)) and weights p aje .Multidimensional Gaussian functions replace the standard ANFIS polynomials and have the following form: where x(k) represents an input vector, c φ aje is a vector-centre of the current Gaussian, and Q aje is the receptive field (covariance) matrix.
This function allows handling data distributed unevenly on the main axes.In case of a diagonal matrix, multidimensional Gaussian simply represents a collection of independent Gaussians with fixed width (Figure 4a).The detailed structure of the consequent unit a F , for each output ˆa y is depicted in Figure 5.The consequent layer output is: where h φ is an amount of multidimensional Gaussian functions for each unit j f .
The output layer is non-parametrized and computed as a sum of its inputs:  The consequent layer output is: where h  is an amount of multidimensional Gaussian functions for each unit j f .
The output layer is non-parametrized and computed as a sum of its inputs: The detailed structure of the consequent unit F a , for each output ŷa is depicted in Figure 5.The detailed structure of the consequent unit a F , for each output ˆa y is depicted in Figure 5.The consequent layer output is: where h φ is an amount of multidimensional Gaussian functions for each unit j f .
The output layer is non-parametrized and computed as a sum of its inputs: The consequent layer output is: where h φ is an amount of multidimensional Gaussian functions for each unit f j .The output layer is non-parametrized and computed as a sum of its inputs:

Learning Procedure
The proposed learning method generalizes the learning approach proposed by Vlasenko et al. [33] for a multivariate case.It comprises two simultaneous processes: optimization in the weights space P and in the consequent functions space.
During the initialization step, the fourth layer function centres c φ ajl are placed equidistantly and Q ajl is initialized as identity matrices.Weights p are all originally set to 0.1.
Weights learning is performed by the Kachmarz method, which is a form of alternating projection method based on solving linear equation systems.It has gained popularity in signal processing, machine learning, and other domains.Each p a has the following form: where p a (k) is a weights matrix, y a (k) is a reference signal, and p a T f a (x(k)) is the model output.
Figure 6 shows a simplified example of the error surface in the weights space.
Data 2018, 3, x FOR PEER REVIEW 7 of 14

Learning Procedure
The proposed learning method generalizes the learning approach proposed by Vlasenko et al. [33] for a multivariate case.It comprises two simultaneous processes: optimization in the weights space P and in the consequent functions space.
During the initialization step, the fourth layer function centres ajl c φ are placed equidistantly and ajl Q is initialized as identity matrices.Weights p are all originally set to 0.1.
Weights learning is performed by the Kachmarz method, which is a form of alternating projection method based on solving linear equation systems.It has gained popularity in signal processing, machine learning, and other domains.Each a p has the following form: Figure 6 shows a simplified example of the error surface in the weights space.The first-order stochastic, or online in terms of machine learning theory, gradient optimization method is used for consequent functions tuning.Stochastic gradient methods have been shown to have many advantages in comparison to popular batch gradient methods in real-life optimization problems including machine learning applications [35,36].Instead of computing the full gradient over all dataset in batches, these methods update free model parameters after processing each input pattern.Their main advantage is the inherent noise of calculated gradients.Additionally, they do not suffer from rounding errors, which occur when having to store the accumulated gradient [37], and have better computational performance.The disadvantage is that they cannot be as easily parallelized as batch methods [36,37] and the severity of such restriction depends on the domain.
Learning is based on the standard mean square error criterion: The first-order stochastic, or online in terms of machine learning theory, gradient optimization method is used for consequent functions tuning.Stochastic gradient methods have been shown to have many advantages in comparison to popular batch gradient methods in real-life optimization problems including machine learning applications [35,36].Instead of computing the full gradient over all dataset in batches, these methods update free model parameters after processing each input pattern.Their main advantage is the inherent noise of calculated gradients.Additionally, they do not suffer from rounding errors, which occur when having to store the accumulated gradient [37], and have better computational performance.The disadvantage is that they cannot be as easily parallelized as batch methods [36,37] and the severity of such restriction depends on the domain.
Learning is based on the standard mean square error criterion: where y(k) is the reference signal value, ŷ(k) is the prognosis signal value, and N is the training set length.The centres c φ ajl are then tuned using: where λ c is a learning step, β c is a dumping parameter, and τ c ajl is a vector of back propagated gradient values with respect to c φ ajl .The Q ajl matrices learning can be written as: where λ Q is a learning step, β Q is a dumping parameter, and τ Q ajl is a matrix of values back propagated with respect to Q ajl .
Vectors η ac and matrices η aQ represent the decaying average value of past gradients and the algorithm initializes with η ac = η aQ = 10, 000. Figure 7 depicts an example of the error surface with respect to centres c φ ajl and Figure 8 shows examples of multidimensional Gaussians after learning is finished.
where ( ) y k is the reference signal value, ˆ( ) y k is the prognosis signal value, and N is the training set length.
The centres ajl c φ are then tuned using: The ajl Q matrices learning can be written as: ) where Q λ is a learning step, Q β is a dumping parameter, and

Experimental Results
We compared the prediction accuracy and computational performance to ANN using bipolar sigmoid activation functions.To train the ANN model, we used the Levenberg-Marquardt algorithm [38] and resilient backpropagation learning algorithm (RPROP) [39] as the popular batch optimization techniques.
The Accord.NET package [40] was used for the ANN implementations and we wrote custom software for the neuro-fuzzy model.In addition, we utilized the Math.NET Numerics package [41] for the linear algebra operations in the neuro-fuzzy model implementation.
The experiments were performed on a computer with Intel ® Core(TM) Core i7-7700 processor (Intel, Santa Clara, California, U.S.) and 32 GB of memory.
Root Mean Square Error (RMSE) and Symmetric Mean Absolute Percent Error (SMAPE) criteria were used to estimate prediction accuracy:

Experimental Results
We compared the prediction accuracy and computational performance to ANN using bipolar sigmoid activation functions.To train the ANN model, we used the Levenberg-Marquardt algorithm [38] and resilient backpropagation learning algorithm (RPROP) [39] as the popular batch optimization techniques.
The Accord.NET package [40] was used for the ANN implementations and we wrote custom software for the neuro-fuzzy model.In addition, we utilized the Math.NET Numerics package [41] for the linear algebra operations in the neuro-fuzzy model implementation.
The experiments were performed on a computer with Intel ® Core(TM) Core i7-7700 processor (Intel, Santa Clara, California, U.S.) and 32 GB of memory.
Root Mean Square Error (RMSE) and Symmetric Mean Absolute Percent Error (SMAPE) criteria were used to estimate prediction accuracy: where y(k) is a reference signal value, ŷ(k) is a prognosis signal value, and N is the training set length.The bipolar sigmoid networks were built from sigmoids with an alpha value 0.4 for all cases, learning rate 0.67, and 50 epochs were used for resilient backpropagation.The Levenberg-Marquardt implementation was trained with 10 epochs.The neuro-fuzzy model had the following parameters The original dataset was divided into a training set of 660 records and 228 records in the validation set.The input vectors for the bivariate case were composed of an equal number of sequential numbers from both dataset components.
We also compared the performance of all models in a univariate case, which is reflected in Table 2.The performed experiments showed that ANN with a resilient backpropagation needs a significant number of epochs to attain good accuracy, greatly reducing computational performance.Levenberg-Marquardt learning requires fewer epochs, but each epoch costs more in terms of computing.The proposed model demonstrated better performance results in all cases-both computational time and accuracy were significantly better than those of competitors.Also, the results in Table 2 show that accuracy improves with a larger h φ , but the downside is longer execution time.
The multivariate neuro-fuzzy model requires less computational resources than the two independent univariate models and has drastically better accuracy.
Visualization of the learning process is presented in Figures 9 and 10; IBM stock returns are depicted for the multivariate and univariate cases.Figure 9 shows the significant in the forecasting plot obtained using the multivariate neuro-fuzzy model.Conversely, Figure 10 does not show such a degree of improvement for the bipolar sigmoid artificial neural model.

Figure 1 .
Figure 1.First 300 records of the normalized IBM stock and the S&P 500 index dataset.

Figure 1 .
Figure 1.First 300 records of the normalized IBM stock and the S&P 500 index dataset.
Figure 2 depicts the modified architecture.

Figure 2 .
Figure 2. Architecture of the proposed neuro-fuzzy model.The fuzzification layer consists of h ψ membership functions for each input variable; hence, their total amount is h n ψ × .Gaussian membership functions are used-this type of membership function is also known as a radial basis function kernel, popular among a variety of machine learning techniques.It has the following form:

Figure 2 .
Figure 2. Architecture of the proposed neuro-fuzzy model.

Figure 3 .
Figure 3. Example of the first layer membership functions.
Gaussianfunctions replace the standard ANFIS polynomials and have the following form:

Figure 3 .
Figure 3. Example of the first layer membership functions.

Figure 4 .
Figure 4.The examples of multidimensional Gaussain with the same centre but different covariance matrices: (a) with diagonal matrix and (b)with matrix

Figure 4 . 14 Figure 4 ..
Figure 4.The examples of multidimensional Gaussain with the same centre but different covariance

Figure 4 .
Figure 4.The examples of multidimensional Gaussain with the same centre but different covariance matrices: (a) with diagonal matrix and (b)with matrix

λ
is a learning step, c β is a dumping parameter, and c ajl τ is a vector of back propagated gradient values with respect to ajl c φ .

τFigure 7
Figure 7 depicts an example of the error surface with respect to centres ajl c φ and Figure 8 shows

Figure 7 .
Figure 7.An example of the error function surface in a simplified two-dimensional case.

Figure 7 .
Figure 7.An example of the error function surface in a simplified two-dimensional case.

Figure 8 .
Figure 8. Examples of the two-dimensional Gaussian functions after learning.

Figure 8 .
Figure 8. Examples of the two-dimensional Gaussian functions after learning.

Figure 9 .
Figure 9. Learning process plot for the proposed neuro-fuzzy model: red dots-IBM stock returns actual values, blue-predicted values.(a) bivariate model; (b) univariate model.

Figure 9 .
Figure 9. Learning process plot for the proposed neuro-fuzzy model: red dots-IBM stock returns actual values, blue-predicted values.(a) bivariate model; (b) univariate model.

Table 1 .
Records example of the IBM stock and the S&P 500 index dataset.

Table 1 .
Records example of the IBM stock and the S&P 500 index dataset.