Forecasting Stock Market Indices Using the Recurrent Neural Network Based Hybrid Models: CNN-LSTM, GRU-CNN, and Ensemble Models

: Various deep learning techniques have recently been developed in many ﬁelds due to the rapid advancement of technology and computing power. These techniques have been widely applied in ﬁnance for stock market prediction, portfolio optimization, risk management, and trading strategies. Forecasting stock indices with noisy data is a complex and challenging task, but it plays an important role in the appropriate timing of buying or selling stocks, which is one of the most popular and valuable areas in ﬁnance. In this work, we propose novel hybrid models for forecasting the one-time-step and multi-time-step close prices of DAX, DOW, and S&P500 indices by utilizing recurrent neural network (RNN)–based models; convolutional neural network-long short-term memory (CNN-LSTM), gated recurrent unit (GRU)-CNN, and ensemble models. We propose the averaging of the high and low prices of stock market indices as a novel feature. The experimental results conﬁrmed that our models outperformed the traditional machine-learning models in 48.1% and 40.7% of the cases in terms of the mean squared error (MSE) and mean absolute error (MAE), respectively, in the case of one-time-step forecasting and 81.5% of the cases in terms of the MSE and MAE in the case of multi-time-step forecasting.


Introduction
Forecasting stock market indices is one of the most critical yet challenging areas in finance, as a key task in investment management. The stock market indices are used to formulate and implement economic policy, and they are also used to inform decisions about the timing and size of various investments, such as stocks and real estate for investors.
In finance, stock market forecasting is one of the most challenging tasks due to the inherently volatile, noisy, dynamic, nonlinear, complex, non-parametric, non-stationary, and chaotic nature of stock markets, making any prediction model subject to large errors [1,2]. Additionally, price fluctuations are influenced not only by historical stock trading data, but also by nonlinear factors, such as political factors, investor behavior, and unexpected events [3][4][5][6].
To overcome these difficulties, numerous studies have been conducted over the past decades to predict various types of financial time-series data.
Linear models, such as the autoregressive and moving average (ARMA) and autoregressive integrated moving average (ARIMA) models have achieved high predictive accuracy in predicting stock market trends. However, traditional statistical models assume that financial time series are linear, which is not the case in real-world scenarios. Meanwhile, as many machine learning techniques capture nonlinear relationships from the data [7], they might be very useful for decision-making with respect to financial market investments [8].
• Novel RNN-based hybrid models are proposed to forecast one-time-step and multitime-step closing prices of the DAX, DOW, and S&P500 indices by utilizing neural network structures: CNN-LSTM, GRU-CNN, and ensemble models. • The novel feature, which is the average of the high and low prices of stock market indices, is used as an input feature. • Comparisons between the proposed and traditional benchmark models with various look-back periods and features are presented. • The experimental results indicate that the proposed models outperform the benchmark models in 48.1% and 40.7% of the cases in terms of the mean squared error (MSE) and mean absolute error (MAE), respectively, in the case of one-time-step forecasting and 81.5% of the cases in terms of the MSE and MAE in the case of multi-time-step forecasting. • Further, compared with previous studies that involved using open, high, and low prices, and trading volume of stock market indices as features, in this study, we evaluate the performance of our models by adding a novel feature to reduce the influence of the highest and lowest prices. The results confirm that the newly proposed feature contributes to improving the performance of the models in forecasting stock market indices. • In particular, the ensemble model provides significant results for one-time-step forecasting.
The remainder of this paper is organized as follows. Section 2 presents an overview of deep learning models and reviews the relevant existing literature on stock market forecasting. Section 3 describes the proposed models designed using RNN-based hybrid architectures and provides the implementation details of the experiment, including the data and experimental setting. In Section 4, we present the experimental results, where we evaluate the proposed models on three stock market indices, compare them with benchmark models, and analyze the effect of the novel feature. Section 5 discusses the implications and advantages of the proposed models. Finally, Section 6 summarizes the conclusions of the study.

ANN
ANNs, also known as feedforward neural networks, are computing systems inspired by the biological human brain and consist of input, hidden, and output layers with connected neurons, wherein connections between neurons do not form a cycle. An ANN is capable of learning nonlinear functions and processing information in parallel [14]. Each neuron computes the weighted sum of all of its inputs, and a nonlinear activation function is applied to this sum to produce the output result of each neuron. The weights are adjusted to minimize a metric of the difference between the actual and predicted values of the data using the back-propagation algorithm [15].

MLP
The perceptron was proposed by [16] in 1943, representing an algorithm for the supervised learning of binary classifiers. As a linear classifier, a single-layer perceptron is the simplest feedforward neural network. Minsky and Papert [17] showed that a singlelayer perceptron is incapable of learning the exclusive or (XOR) problem, whereas an MLP is capable of solving the XOR problem.
An MLP is a fully connected class of ANN. Attempts to solve linearly inseparable problems, such as the XOR problem, have led to different variations in the number of layers and neurons as well as nonlinear activation functions, such as a logistic sigmoid function or a hyperbolic tangent function [18].

CNN
The CNN was proposed to automatically learn spatial hierarchies of features in tasks, such as image recognition and speech recognition [19], by exploiting the spatial relationships among the pixels in an image. In [20], a CNN is composed of convolutional layers, pooling layers, and fully connected layers, and is trained with the adaptive moment estimation (Adam) optimizer on mini batches [21]. The convolutional layers extract the useful features, while the pooling layers reduce the dimensions of the feature maps. The rectified linear unit (ReLU) is applied as a nonlinear activation function [22], and a dropout layer is used as a regularization method in which the output of each hidden neuron is set to zero with a given probability [23].

RNN
The assumption of a traditional neural network is that all the inputs are independent of each other, which makes them ineffective when dealing with sequential data and varied sizes of inputs and outputs [24]. The RNN is an extension of the conventional feedforward neural network and is well suited to sequential data, such as time series, gene sequences, and weather data.
An RNN has memory loops and handles a variable length of input sequence by having a recurrent hidden state [25]. It is known to have a shortcoming of a significant decrease in learning ability as the gradient gradually decreases during back-propagation when the distance between the relevant information and the point is long, which is called the vanishing gradient problem [26]. Errors from later time steps are difficult to propagate back to previous time steps, which results in difficulty in training deep RNNs to preserve information over multiple time steps because the gradients tend to either vanish or explode as they cycle through feedback loops [27]. To address this problem, Hochreiter and Schmidhuber [26] proposed the LSTM, which is capable of solving the vanishing gradient problem using memory cells.

LSTM
The LSTM was proposed by [26] as a variant of the vanilla RNN to overcome the vanishing or exploding gradient problem by adding the cell state to the hidden state of an RNN. The LSTM is composed of a cell state and three gates: input, output, and forget gates. The following equations describe the LSTM architecture.
The forget gate f t determines which information is input to forget or keep from the previous cell state C t−1 and is computed as where x t is the input vector at time t the function σ is a logistic sigmoid function. The input gate i t determines which information is updated to the cell state C t and is computed by The candidate value C t that can be added to the state is created by a tanh activation function and is computed by The cell state C t can store information over long periods of time by updating the internal state and is computed by where the operator represents the element-wise Hadamard product. The output gate o t determines what information from the cell state to be used as an output by taking the logistic sigmoid activation function, and is computed by and the output h t is computed as where W * and b * represent weight matrices and bias vectors, respectively.
Following [26], the gates decide which information to be forgotten or to be remembered; therefore, the LSTM is suitable for managing long-term dependencies and forecasting time series with different numbers of time steps. Further, it can generalize and handle noise, distributed representations, and continuous values well.

GRU
The GRU, proposed in [28], is a simpler variation of LSTM and has fewer parameters than LSTM. The LSTM has update, input, forget, and output gates and maintains the internal memory state, whereas the GRU has only update and reset gates. It combines the forget and input gates of LSTM into a single update gate and has fewer tensor operations, resulting in faster training than LSTM.
The GRU merges the cell and hidden states. It performs well in sequence learning tasks and overcomes the problems of vanishing or exploding gradients in vanilla RNNs when learning long-term dependencies [29]. It also tends to perform better than LSTM on fewer training data, whereas LSTM is more efficient in remembering longer sequences [30][31][32]. The following equations describe how memory cells at each hidden layer are updated at each time step [33]. The reset gate r t controls the influence of h t−1 and is computed as where x t and h t−1 are the input and the previous hidden state, respectively. The update gate z t specifies whether to ignore the current information x t and is computed as The computation of candidate activation h t is similar to that of the traditional recurrent unit, that is, where W r , W z and W h are weight matrices which are learned.

Related Work
A stock market index is an important indicator of changes in the share prices of different companies, thus informing investment decisions. It is also more advantageous to invest in an index fund than to invest in individual stocks because it keeps costs low and removes the need to constantly manage reports from many companies. However, stock market index forecasting is extremely challenging because of the multiple factors affecting the stock market, such as politics, global economic conditions, unexpected events, and the financial performance of companies listed on the stock market.
Recently, deep-learning models have been extensively applied to numerous areas in finance, such as the forecasting future prices of stocks, prediction of stock price movements, portfolio management, risk assessment, and trading strategies [34][35][36][37][38][39]. Using deep learningbased models, such as CNNs, RNNs, LSTMs, and GRUs, studies have shown that such models outperform classical methods for time series forecasting tasks because of their ability to handle nonlinearity [19,25,26,33].
CNN models have been used in different time series forecasting applications. Chen et al. [40] and Sezer and Ozbayoglu [41] transformed time-series data into two-dimensional image data and used them as inputs for a CNN to classify the movement of the data. Meanwhile, Gross et al. [42] interpreted multivariate time series as space-time pictures.
RNN-based models have been used to predict time-series data. Fischer and Krauss [43] showed that LSTM outperformed memory-free classification methods, such as random forests, deep ANNs, and logistic regression classifiers, in prediction tasks. Dutta et al. [44] proposed the GRU model with recurrent dropout to predict the daily cryptocurrency prices.
Other deep learning models have been applied for time series forecasting. Heaton et al. [45] stacked autoencoders to predict and classify stock prices and their movements. Abrishami et al. [7] used a variational autoencoder to remove noise from the data and stacked LSTM to predict the close price of stocks. Wang et al. [2] used wavelet transform to forecast time-series data.
Moreover, various architectures combining deep learning-based models have been proposed in the literature. Ilyas et al. [46] combined technical and content features via learning time series and textual data, Livieris et al. [47] introduced the CNN-LSTM model to predict gold prices and movements, while Daradkeh [6] integrated a CNN and a bidirectional LSTM to predict stock trends. Zhang et al. [4] combined attention and LSTM models for financial time series prediction. Livieris and Pintelas [48] proposed ensemble learning strategies with advanced deep learning models for forecasting cryptocurrency prices and movements. Bao et al. [24] combined wavelet transforms, stacked autoencoders, and LSTM to forecast the closing stock prices for the next day by eliminating noise from the data and generating deep high-level features. Meanwhile, Zhang et al. [49] proposed a novel architecture of a generative adversarial network (GAN) with an MLP as the discriminator and an LSTM as the generator for forecasting the closing price of stocks.
This study proposes three models by combining CNN and RNN-based models for predicting the stock market index. Additionally, in contrast to existing studies, which employed open, high, and low prices, trading volume, and change in stock market indices, we introduce a novel input feature: the average of high and low prices. Furthermore, the three proposed models are evaluated on three daily stock market indices with two different optimizers and four different features. Finally, we compare the performance of the proposed models with conventional benchmark models with respect to forecasting the closing prices of the stock market indices.

Proposed Models
Following Livieris and Pintelas [48], by combining prediction models, a bias is added, which in turn reduces the variance, resulting in a better performance than that of single models. Therefore, we propose three RNN-based hybrid models that predict the stock market indices for one-time-step and multi-time-step at a time.

Proposed CNN-LSTM Model
CNNs can effectively learn the internal representations of time-series data [47]. The one-dimensional convolutional layer filters out the noise, extracts spatial features, and reduces the number of parameters. The causal convolution ensures that the output at time t derives only inputs from time t − 1. RNNs are considered the best sequential deep-learning models for forecasting time-series data. To this end, we combine a one-dimensional CNN and an LSTM in a new model: CNN-LSTM. The CNN-LSTM model consists of (1) a onedimensional convolutional layer, (2) an LSTM layer, (3) a batch-normalization layer, (4) a dropout layer, and (5) a dense layer.
To determine the best-performing parameters, we examined different variants of the model: the number of hidden layers (1 and 2), the number of neurons (64 and 128), the batch size (32 and 64), and the dropout rate (0.2 and 0.5).
The best-performing CNN-LSTM model comprised a one-dimensional convolutional layer with 32 filters of size 3 with a stride of 1, causal padding, and the ReLU activation function; an LSTM layer with 128 units and tanh activation function; a batch-normalization layer; a dropout layer with a rate of 0.2; and a dense layer with a prediction window size of units and the ReLU activation function. Figure 1 illustrates the architecture of the proposed CNN-LSTM model, while Table 1 summarizes the configuration.

. Proposed GRU-CNN Model
The GRU is simpler than LSTM, has the ability to train sequential patterns, and takes less time to train the model with improved network performance. To utilize both GRU and one-dimensional CNN, we propose a stacked architecture where a GRU and a onedimensional CNN are combined, namely the GRU-CNN model. The parameters used for the GRU-CNN model were similar to those of the CNN-LSTM model, as described in Section 3.1.1. The difference between the CNN-LSTM and GRU-CNN models is in the order of stacking the RNN and CNN layers.
The GRU-CNN model consists of a GRU layer with 128 units and the tanh activation function; a one-dimensional convolutional layer with 32 filters of size 3 with a stride of 1, causal padding, and the ReLU activation function; a one-dimensional global max-pooling layer; a batch-normalization layer; a dense layer with 10 units and the ReLU activation function; a dropout layer with a rate of 0.2; and a dense layer with a prediction window size of units and the ReLU activation function. In the GRU-CNN model, the GRU layer returns a sequence, and the one-dimensional global max-pooling layer takes only important features and reduces the dimension of the feature map. The architecture of the proposed GRU-CNN model is illustrated in Figure 1, while the configuration is listed in Table 1.

Proposed Ensemble Model
While evaluating the performance of the benchmark models, various RNN models, such as RNN, LSTM, and GRU, exhibited high predictive performance on different types of datasets. There are three types of widely employed ensemble learning strategies: ensemble averaging, bagging, and stacking. Based on the results of the benchmarks, the CNN-LSTM, and the GRU-CNN as implemented above, we propose an average ensemble of three RNNbased models to achieve averaged high performance for various datasets. The proposed ensemble model can utilize the representations of the RNN, LSTM, and GRU models. The parameters used for the ensemble model were similar to those of the CNN-LSTM and GRU-CNN models, as described in Section 3.1.1.
The ensemble model consists of an RNN layer with 128 units and the tanh activation function; an LSTM layer with 128 units and the tanh activation function; a GRU layer with 128 units and the tanh activation function; followed by taking the average of all the hidden states from RNN, LSTM, and GRU; a dropout layer with a rate of 0.2; a dense layer with 32 units and the ReLU activation function; and a dense layer with a prediction window size of units and the ReLU activation function. Figure 1 illustrates the details of each layer of the proposed ensemble model, while Table 1 presents the configuration.

Implementation Details
In this subsection, we present an extensive empirical analysis of the proposed models on three datasets. First, we describe the datasets and the experimental setting used to demonstrate the validity of our financial time-series prediction models. Next, we evaluate the performance of our models on several datasets and compare them with those of conventional deep learning models.

Dataset
We evaluated the performance of the proposed models on daily stock market indices to verify the robustness of our models. We considered three stock market indices from major stock markets listed below. The historical prices of each stock market index were obtained using the Finance-DataReader open-source library available in the pandas DataReader module of the Python programming language [56]. The raw data included six features: open, high, low, and close prices, trading volume, and change. The incomplete data were removed.
Before feeding the raw data into our models, we pre-processed the data. We normalized the raw data using Scikit-learn's MinMaxScaler tool, as follows: where x is the input feature of the stock market index and x max and x min are the maximum and minimum values of each input feature, respectively. Granger [57] suggested holding approximately 20% of the data for out-of-sample testing. Following this suggestion, the first 80% of the data were used as the training set for in-sample training, while the remaining 20% were used as the test set, to ensure that our models were evaluated on unseen outof-sample data. The first 90% of the training set was used to train the network and to iteratively adjust its parameters such that the loss function was minimized. The trained network predicted the remaining 10% for validation, and the validation loss was computed after each epoch.

Generation of the Inputs and Outputs Using the Sliding Window Technique
This subsection describes the generation of the inputs and outputs. The daily open, high, and low prices, trading volume, and change were commonly used as input features in other studies. However, in the current study, we introduce a novel feature named medium, which is the average of high and low prices, to reduce the influence of the unusually extreme highest and lowest prices and to ensure generalizability.
For each stock market index, the partial features of daily open, high, low, and medium prices, trading volume, and change (OHLMVC) were used as the input to train the model, and the daily close prices were used as the output to predict one-time-step and multi-timestep ahead.
For the input and output generation, the normalized data were segmented using the sliding window technique, by which a fixed window size of time-series data was chosen as the input and a fixed number of the following observations was chosen as the output. This process was repeated for the entire dataset by sliding the window in intervals of one time step to obtain the next input and output. We trained the proposed models to look at m consecutive past data of features. The input at time t was denoted by where for each k ∈ {O, H, L, M, V, Ch}, , and x Ch are the daily open, high, low, and medium prices, trading volume, and change from time t − m + 1 to time t, respectively. The input X t was fed sequentially into the proposed models to predict the following n daily close prices of stock market indices, with the output denoted by The look-back periods of 5, 21, and 42 days were considered as one week, one month, and two months, respectively; while the look-ahead periods of one and five days were considered to predict the future one-time-step or multi-time-step ahead. Figure 2 illustrates the sliding window technique.

Software and Hardware
The proposed models were implemented, trained, and analyzed in Python 3.7.6 [58] with the Keras library 2.4.3 [59] as a high-level neural network API using TensorFlow 2.3.1 as back-end [60], relying on NumPy 1.19.2 [61], Pandas 0.25.3 [56], and Scikit-learn 1.0.2 [62]. The code used for producing the figures and analysis is available on GitHub at https://github.com/hyunsunsong/Project. All experiments were performed using a workstation equipped with an Intel Xeon Silver 4208 CPU at 2.10 GHz x8, Nvidia GPU TITAN, and 12 GB RAM on each board.

Experimental Setting
The proposed models were trained with the Huber loss function, which combines the characteristics of MSE and MAE and is less susceptible to outliers in the data than the MSE loss function [63]. It behaves quadratically for small residuals and linearly for large residuals [64]. The parameters of the network were learned to minimize the average of the Huber loss function over the entire training dataset.
The network weights and biases were initialized with the Glorot-Xavier uniform method and zeros, respectively. Glorot and Bengio [65] proposed the Glorot-Xavier uniform method to adopt a properly scaled uniform distribution for initialization.
The successful applications of neural networks require regularization [66]. Introduced by [23], the dropout regularization technique randomly drops a fraction of the units with a specified probability, along with connections during training, while all units are presented during testing. We applied the dropout values of 0.2 and 0.5 to reduce overfitting and have observed that higher dropout value result in a decline in performance. Therefore, we settled on the relatively low dropout value of 0.2 as studied in [67].
The batch size and maximum number of epochs were set to 32 and 50, respectively, and an early stopping patience of 10 was applied [68]. That is, once the validation loss no longer decreased for the patience period, the training was stopped, and the weights of the model with the lowest validation loss were restored using ModelCheckpoint callback in the Keras library [59].
The optimization algorithms used for training were the Adam and root mean square propagation (RMSProp) [69], which are adaptive learning rate methods. The RMSProp is usually a viable choice for RNNs [59]. We compared the performance of the proposed models using two different optimizers.
We applied learning rates of 0.001 and 0.0005 and found that a learning rate of 0.0005 resulted in a better performance. Therefore, the learning rate was set to 0.0005.
The ReLU activation function proposed in [22] was used for the dense layers, and the data shuffling technique was not used during training.

Predictive Performance Metrics
In this study, we adopted the MSE and MAE as evaluation metrics to compare the performance of the proposed models with that of conventional benchmark models for forecasting time-series data, which are calculated as follows: where T is the number of prediction time horizons; y t andŷ t are the true and predicted values, respectively, during one-time-step prediction. During multi-time-step prediction, we only used the value of the last time step; thus, y t andŷ t represent the true and predicted values of the last time step, respectively.

Experimental Results
In this section, we present the experimental results of the proposed models using historical time-series data for three stock market indices: DAX, DOW, and S&P500. We first describe the details of the benchmark models used for comparison. Second, we compare the results for the proposed models and conventional benchmarks with respect to one-time-step and multi-time-step predictions on three datasets over three different periods. Third, we present the results of the impact of different features and optimizers on the performance of the proposed models.

Benchmark Models
For benchmark comparison, we deploy several conventional deep learning models, such as RNN, LSTM, and GRU, to examine whether the proposed models outperform the benchmarks. In addition, we utilize WaveNet, which combines causal filters with dilated convolutions, so that the model learns long-range temporal dependencies in time-series data [70]. The benchmark models and corresponding architectures are listed below.

1.
RNN: Two RNN layers with 128 units and a dense layer with a look-ahead period of units; 2.
LSTM: An LSTM layer with 128 units and a dense layer with a look-ahead period of units; 3.
GRU: A GRU layer with 128 units and a dense layer with a look-ahead period of units; 4.
WaveNet: A simpler architecture of an audio generative model based on Pixel-CNN [71], as described in [70]. Table 2 lists the training setting for the benchmark models. All benchmark models were trained with 50 epochs, an early stopping patience of 10, a learning rate of 0.0005, a batch size of 32, the MSE loss function, the Adam optimizer, and the ReLU activation function.

One-Time-Step Prediction Comparisons between Proposed and Benchmark Models
In this subsection, we provide the experimental results of the proposed models to predict the one-time-step ahead of the three stock market indices. We evaluated the performance of the proposed models with various look-back periods of 5, 21, and 42 days as one week, one month, and two months, respectively, for different periods. The proposed and benchmark models were implemented as described in previous sections. The Adam optimizer and OHLV features were used for all methods in Table 3.  Table 3 compares our models and the benchmark models for the different look-back periods for one-time-step prediction, where the best performance results are marked in bold for each stock market index, period, and metric.
According to the results in Table 3, increasing the look-back period slightly enhances the performance across all operating conditions by keeping all other hyperparameters constant. Moreover, a very long sequence length, such as the look-back period of 42, increases the performance. From 1 January 2000 through 31 December 2019, the proposed models improved the benchmarks in 77.8% and 77.8% of cases in terms of MSE and MAE, respectively. Additionally, our models outperformed the benchmarks in 66.7% and 77.8% of cases in terms of MSE and MAE, respectively, for the period from 1 January 2017 through 31 December 2019, before the COVID-19 pandemic, and in 100% of cases for the period from 1 January 2019 through 31 December 2021, after the COVID-19 pandemic, in terms of both MSE and MAE. Some results were the same for different benchmarks because the training algorithm might find the local optima. Further, an overall comparison between the ensemble model and other models in Table 3 indicates that the ensemble model significantly outperformed the other models.
We evaluated the performance of the proposed models with four different features (i.e., MV, MVC, OHLV, and OHLMVC) in addition to a novel feature, medium, defined as the average of high and low prices. In comparison, OHLVs have been commonly used as features in other studies.
In addition, we evaluated the performance of our models using two different optimizers, Adam and RMSProp, by keeping all other hyperparameters constant. The average MSE and MAE over the three periods for the impact of different features and optimizers of the proposed models are shown in Tables 4 and 5, where the best performance results are marked in bold for each stock market index, a look-back period, and optimizer. Regarding the one-time-step prediction, Table 4 shows that the CNN-LSTM, GRU-CNN, and ensemble models with the novel medium feature outperformed the other models in 83.3%, 33.3%, and 0% of cases with the DAX dataset; 83.3%, 100%, and 16.7% of cases with the DOW dataset; and 83.3%, 83.3%, and 33.3% of cases with the S&P500 dataset, respectively, in terms of the average MSE over the three periods. Table 5 shows that the CNN-LSTM, GRU-CNN, and ensemble models incorporating the medium feature outperformed the other models in 83.3%, 33.3%, and 16.7% of cases with the DAX dataset; 83.3%, 100%, and 16.7% of cases with the DOW dataset; and 66.7%, 66.7%, and 33.3% of cases with the S&P500 dataset, respectively, in terms of the average MAE over the three periods.
An overall comparison between the models incorporating the medium feature and the models without the medium feature shows that adding the medium feature improves the performances of all models. In addition, the proposed models were trained for 1500 epochs with the RMSProp optimizer and MV features to achieve higher performance than that of the model trained as described in Section 3.2.4. Figures 3-5 compare the actual and predicted close prices of the DAX, DOW, and S&P500 indices, respectively, for the different look-back periods. In Figures 3-5, the left, middle, and right plots correspond to the look-back periods of 5, 21, and 42 days, respectively. Further, the look-back period and stock market index evidently affect the model performance.

Multi-Time-Step Prediction Comparisons between Proposed and Benchmark Models
In this subsection, we evaluated the performance of the proposed models with various look-back periods and provided the experimental results to predict multi-time-step ahead for the three stock market indices. The look-back periods of 5, 21, and 42 days and the look-ahead period of five days were adopted for each period. The proposed and benchmark models were implemented as described in previous sections. The Adam optimizer and OHLV features were used for all methods in Table 6. Table 6 compares the proposed and benchmark models in terms of different look-back periods for five-time-step prediction, where the best performance results are marked in bold for each stock market index, period, and metric.
From the table, the proposed models outperformed the benchmarks in 66.7% and 66.7% of cases for the period from 1 January 2000 through 31 December 2019; in 22.2% and 11.1% of cases for the period from 1 January 2017 through 31 December 2019, before the COVID-19 pandemic; and in 55.6% and 55.6% of cases for the period from 1 January 2019 through 31 December 2021, after the COVID-19 pandemic in terms of MSE and MAE, respectively.
For long-term predictions, the MSE and MAE were not as good as for short-term predictions. Specifically, the results showed that the errors grew with the increase in prediction steps, highlighting that long-term predictions are more challenging than shortterm ones. Nonetheless, the ensemble model still outperformed conventional benchmark models in long-term predictions. Table 6. Comparison of five-time-step prediction between proposed and benchmark models. We evaluated the performance of the proposed models with four different features and two different optimizers.
The average MSE and MAE for the use of different features and optimizers of the proposed models over the three periods are shown in Tables 7 and 8, where the best performance results are marked in bold for each stock market index, a look-back period, and optimizer.
For multi-time-step prediction, in terms of the average MSE over the three periods, Table 7 confirms that the CNN-LSTM, GRU-CNN, and ensemble models with the introduced medium feature outperformed the other models in 66.7%, 83.3%, and 83.3% of cases with the DAX dataset; 66.7%, 66.7%, and 50% of cases with the DOW dataset; and 66.7%, 50%, and 83.3% of cases with the S&P500 dataset.
Further, for multi-time-step prediction, in terms of the average MAE over the three periods, Table 8 shows that the CNN-LSTM, GRU-CNN, and ensemble models with the novel medium feature outperformed the other models in 83.3%, 100%, and 83.3% of cases with the DAX dataset; 66.7%, 100%, and 66.7% of cases with the DOW dataset; and 66.7%, 50%, and 83.3% of cases with the S&P500 dataset. In addition, the proposed models were trained for 1500 epochs with the Adam optimizer and OHLV features to achieve higher performance than that of the model as described in Section 3.2.4. Figures 9-11 compare the actual and predicted close prices of the DAX, DOW, and S&P500 indices, respectively, with respect to the different look-back periods. In Figures 9-11, the left, middle, and right plots correspond to the look-back periods of 5, 21, and 42 days, respectively. The look-back period and stock market index also evidently affect the model performance.    Moreover, the proposed models were trained for 1500 epochs with the Adam optimizer and a look-back period of 5 days. The comparisons of true and predicted close prices of the DAX, DOW, and S&P500 indices between different input features for five-time-step prediction are provided in Figures 12-14, respectively.

Discussion
Various deep-learning techniques have been applied extensively in the field of finance for stock market prediction, portfolio optimization, risk management, and trading strategies. Although forecasting stock market indices with noisy data is a complex and challenging process, it significantly affects the appropriate timing of buying or selling investment assets for investors as they reduce the risk, which is one of the most valuable areas in finance.
Combining multiple deep-learning models results in a better performance [48]. We proposed to integrate RNNs, namely, CNN-LSTM, GRU-CNN, and ensemble models. The proposed models were evaluated to forecast the one-time-step and multi-time-step closing prices of stock market indices using various stock market indices, look-back periods, optimizers, features, and the learning rate.
The experimental results revealed that the proposed models that combine variants of RNNs outperformed the traditional machine learning models, such as RNN, LSTM, GRU, and WaveNet in most cases. In particular, the ensemble model produced significant results for one-time-step forecasting. Moreover, compared with the performance of previous studies that used open, high, and low prices and trading volume of stock market indices as features, that of our models improved by incorporating the proposed novel feature, which is the average of the high and low prices. Furthermore, our models with MV features provided favorable results in numerous cases. Notably, reducing the number of features could be interpreted as circumventing the overfitting.
The performance of the proposed and benchmark models with the Adam optimizer and OHLV features over three periods were evaluated to predict one-time-step and fivetime-step using look-back periods of 5, 21, and 42 days as provided in Tables 3 and 6, respectively. The comparisons of the average MSE and MAE over three periods for different look-back and look-ahead periods are provided in Figures 15 and 16, respectively. An overall comparison between the ensemble model and other models in Figures 15 and 16 indicates that the ensemble model significantly outperformed the other models. In addition, the performance of the proposed and benchmark models over three periods were evaluated to compare the impact of four different input features (i.e., MV, MVC, OHLV, and OHLMVC) for one-time-step and five-time-step predictions with three look-back periods and two optimizers as described in Section 3.2. The comparisons of the average MSE and MAE of the proposed and benchmark models over all periods, optimizers, look-back, and look-ahead periods are provided in Figures 17 and 18, respectively. The proposed models outperform the benchmark models and the performance of our models improves by incorporating the proposed medium feature.   During the course of this study, the Russia-Ukraine crisis escalated on 24 February 2022. Additional experiments were conducted to examine the impact of this crisis on each stock market index for the period from 1 January 2021 through 15 February 2023.
We evaluated the performance of the proposed and benchmark models to predict one-time-step and five-time-step ahead with various look-back periods of 5, 21, and 42 days as one week, one month, and two months, respectively. The architectures of the proposed and benchmark models have been described in Sections 3.1 and 4.1, respectively.
The proposed and benchmark models were implemented with 50 epochs, an early stopping patience of 10, a batch size of 32, a learning rate of 0.0005, the Adam optimizer, the ReLU activation function, and OHLV features. The network weights and biases were initialized with the Glorot-Xavier uniform method and zeros, respectively. The proposed and benchmark models were trained with the Huber loss function and MSE loss function, respectively. Table 9 compares our models with the benchmark models for the different look-back periods for one-time-step and five-time-step predictions, where the best performance results are marked in bold for each stock market index, period, and metric. Table 9 indicates that the proposed models improved the benchmarks in several cases and that the ensemble model significantly outperformed the other models. Table 9. Comparison of one-time-step and five-time-step predictions between proposed and benchmark models for the period from 1 January 2021 through 15 February 2023. Further, compared with other forecasting methods in other fields, the proposed framework herein can be applied to forecasting time-series data, such as energy consumption, oil price, gas concentration, air quality, and river flow. Moreover, the performance of forecasting can be improved by combining different types of RNN-based models and constructing a portfolio using predicted stock market prices in future studies.

Conclusions
In this paper, we proposed three RNN-based hybrid models, namely CNN-LSTM, GRU-CNN, and ensemble models, to make one-time-step and multi-time-step predictions of the closing price of three stock market indices in different financial markets. We evaluated and compared the performance of the proposed models with conventional benchmarks (i.e., RNN, LSTM, GRU, and WaveNet) over three different periods: a long period of more than 15 years and two short periods of three years before and after the COVID-19 pandemic. The proposed models significantly outperformed the benchmark models by achieving high predictive performance for various sizes of look-back and look-ahead periods in terms of MSE and MAE. Moreover, we found that the proposed ensemble model was comparable to the GRU, which performed well among benchmarks and outperformed the benchmarks in many cases.
Additionally, we introduced a novel feature, medium, which is the average of high and low prices, and evaluated the performance of the proposed models with four different features and two different optimizers. The results indicated that incorporating the novel feature improved model performance. Overall, our experiments verified that the proposed models outperformed the benchmark models in many cases and that incorporating the medium feature improved their performance.