DERN: Deep Ensemble Learning Model for Short-and Long-Term Prediction of Baltic Dry Index

: The Baltic Dry Index (BDI) is a commonly utilized indicator of global shipping and trade activity. It inﬂuences stakeholders’ and ship-owners’ decisions respecting investments, chartering, operational plans, and export and import activities. Accurate prediction of the BDI is very challenging due to its volatility, non-stationarity, and complexity. To help stakeholders and ship-owners make sound short- and long-term maritime business decisions and avoid market risk, we performed short- and long-term predictions of BDI using an ensemble deep-learning approach. In this study, we propose to apply recurrent neural network models for BDI prediction. The state-of-the-art of sequential deep-learning models such as RNN, LSTM, and GRU are employed to predict one-and multi-step-ahead BDI values. In order to increase the accuracy, we assemble the models. In experiments, we compared our results with those of traditional methods such as ARIMA and MLP. The results showed that our proposed method outperforms ARIMA, MLP, RNN, LSTM, and GRU in both short- and long-term prediction of BDI.


Introduction
The Baltic Dry Index (BDI) is a freight index created by the London-based Baltic Exchange. This index indicates shipment costs for dry bulk cargoes consisting of commodities such as grain, coal, iron, ore, and copper. The BDI is a composite of three sub-indices, namely Capesize, Panamax, and Supramax. Those indices have different bulk-carrier capacities, 180,000, 74,000, and 58,000 dwt, respectively. The BDI has been widely used as a world-trade economic indicator [1]. Many stakeholders make serious efforts to forecast it, precisely, so as to be able to make smart investment and trading decisions. However, the volatility, non-stationarity, and complexity of the BDI is known to be more intractable than stock prices. Therefore, it is a challenging task to perform predictions against BDI values.
The BDI is regarded as a barometer not only of the shipping industry and international trade, but also of the global economy [2]. Investors, speculators, and researchers have long found it to be useful, theoretically challenging, and relevant when projecting future profits. However, because many managerial decisions are based on future prospects, forecasting accuracy is essential for organizations and companies in order to avoid market risk. Recent advances in both analytical and computational methods have resulted in a number of new ways of mining freight-index time-series data.
Ship owners, stakeholders and investors need to be concerned about not only short-term prediction of time-series data but also long-term prediction. Predicting a long-term sequence of time-series data is more difficult than short-term prediction [3]. For example, in making a decision on a vessel, there are multiple options available to the vessel's owners. If the BDI trend is increasing,

Background
In this section, we present the background on our BDI-related research and discuss and some time-series prediction models available in the literature.

Related Works
Cullinane et al. [4] were the pioneers in conducting research on BDI prediction using the ARIMA model. In the past several years, there has been some research done on BDI prediction. Cho and Lin used a fuzzy neural network model to analyze and forecast BDI [5], and Kamal et al. [6] forecast BDI as a high-dimensional multivariate regression problem by using deep neural networks. Sahin et al. [7] predicted one-step-ahead BDI values by their proposed three artificial neural networks, specifically a univariate model and two bivariate models, by harnessing historical BDI data and the world price of crude oil. Qingcheng et al. [8] proposed a decomposition technique for BDI data, and then used a neural network for prediction. Zhang et al. [9] compared econometric models such as ARIMA and GARCH with artificial neural network models such as BPNN, RBFNN, and ELM. The majority of the previous research has treated BDI prediction as regression and short-term prediction tasks. In the present study, by contrast, we conducted research on both short-and long-term prediction of BDI to facilitate ship-owners' short-and long-term decision-making. In addition, in terms of models, the majority of the previous studies have harnessed artificial neural network models and statistical models. Note that in terms of sequential learning for prediction of time-series data, the recurrent neural network is the state-of-the-art method. Moreover, nowadays, deep learning with a deep architecture is a promising approach for accurate prediction.
Over the course of the past few decades, there have been many outstanding approaches to the prediction of time-series data, such as ARIMA [10], Support Vector Regressor (SVR) [11], fuzzy systems [12], and deep learning [13][14][15][16]. Nonetheless, by those methods alone, accurate prediction of real data is unobtainable, since real, time-series data commonly is volatile and non-stationary. For enhanced prediction, some researchers have proposed data transformation [17], decomposition [18][19][20], and even ensemble methods [21,22]. In the present study, we harnessed and combined deep-learning approaches including a deep recurrent neural network (Deep RNN), a long-short-term memory network (LSTM) and a gated rectified unit neural network (GRU) to obtain more accurate prediction results.

Sequential Model
Recurrent Neural Network (RNN) is a type of neural network with loops that allow for retention of information from the past. Specifically, the loops enable RNN to use information from past time slices to produce output for the current time slice t. Thus, we can say that the decision made at a time slice t − 1 affects the decision to be made at t. Therefore, the response of the network to the new data depends on the current input as well as the output from the recent past data. The RNN output calculation is based on iterative calculation of the output of the following two equations: In Equations (1) and (2), x t is the input sequence, y t is the output sequence, and h t represents the hidden vector sequence at time slice t (t = 1, 2, ..., T); W and b represent weight matrices and biases, respectively; and lastly, H is an activation function used for the hidden layer. The back-propagation through time (BPTT) technique is usually used to train RNNs [23]. However, it is difficult to use BPTT to train traditional RNNs, due to the gradient-vanishing and exploding problem [24]. Errors from later time steps are difficult to propagate back to previous time steps for proper updates of network parameters. To address this problem, the long short-term memory (LSTM) unit has been developed [25].
LSTM is a special type of recurrent neural network with memory cells. These memory cells are the essential component for handling of long-term temporal dependencies in the data. LSTM has the option to add or delete information from this cell state. This operation is done by special structures in LSTM, which are called gates. The three types of gate are input gate (i t ), forget gate ( f t ), and output gate (o t ), shown in Equations (3) to (8).C t is a "candidate" hidden state that is computed based on the current input and the previous hidden state. C t is the internal memory of the unit. It is a combination of the previous memory, as multiplied by the forget gate, and the newly computed hidden state, as multiplied by the input gate. h t is the output hidden state, and is computed by multiplying the memory by the output gate [26].
GRU is another type of RNN with memory cells [27]. It is similar to LSTM but with a simpler cell architecture. GRU also has a gating mechanism to control the flow of information through the cell state, but has fewer parameters and does not contain an output gate. It consists of two gates, r being a reset gate, and z an update gate. The reset gate regulates the flow of new input to the previous memory, and the update gate determines how much of the previous memory to keep. The following equations are used in the GRU output calculations: In previous studies [28,29], it has been noted that GRU is comparable to, or even outperforms, the LSTM. Regarding the obtainment of high accuracy in prediction of the BDI, in this study, we combined RNN, LSTM, and GRU into an ensemble method. The idea was to combine the predictions from multiple different sequential models. Each model has different strength and weakness, meaning that its predictions are better than any other in a certain condition. Importantly, the models must be good in different ways: they must make different prediction errors. In addition to reducing the variance in the prediction, our ensemble can also result in better predictions than any single best model. For instance, Krizhevsky et al. used model averaging across multiple well-performing CNN models to achieve outstanding results [30].

Method
In this section, our data pre-processing technique, followed by the system design of our proposed method and the metric measure used to assess the accuracy of our approach are explained.

Data Pre-Processing and Analysis
The BDI data plotted in Figure 1a shows a sharp increase and a dramatic decrease between 2007 and 2008. Therefore, we pre-processed the data in a way to make it more stationary. By using a decomposition technique, the BDI data is separated into three components: trend, seasonality, and noise, as depicted in Figure 1c. The trend can be observed as increasing or decreasing the trend value in the time-series data; however, the BDI data does not show any significant increasing or decreasing trend, but rather a peak in 2008 and two long tails. The seasonality repeats the short-term cycle in the BDI, and the noise corresponds to random variation in the series. Due to the complexity of the bulk shipping market and the non-linear nature of freight rates series [31], in this study, some data transformation techniques, including difference transform, power transform, log transform, standardization and normalization, were employed. As indicated in Table 1, for each transformation technique, we conducted a Dickey-Fuller stationary test to ensure its effectiveness. Normalization is a rescaling of data so that all values are within a certain range. As shown in Equations (1), (8) and (11), the value range of the tanh function is between −1 and 1; therefore, we shrunk the BDI to this range. Different from normalization, the standardization technique rescales the dataset according to the distribution of values so that the mean of the observed values is 0 and the standard deviation is 1. Further, simple transformation techniques such as power transform and log transform are performed. The 1st difference transform applied to a time series x creates a new series z whose value at time t is the difference between x(t + 1) and x(t). This method works very well in removing trends and cycles. As shown in Table 1, '1st difference transform' results in the smallest p-value, which indicates that it generates more stationary time-series data than the original data. Therefore, in this research, we transformed the BDI data into a more stationary form by using '1st difference transform', before feeding it into the model. The 1st-difference-transformed data is plotted in Figure 1b.

Deep Ensemble Recurrent Network
The design system of our approach is depicted in Figure 2, and the training process is expressed in Algorithm 1. Firstly, the pre-processed data is independently trained by Deep RNN, LSTM, and GRU. After those models converge in learning BDI data, in the testing phase, each of the predictions of RNN, LSTM, and GRU, represented as P 1 , P 2 and P 3 , are summarized using weighting w p1 , w p2 and w p3 , respectively, to predict the P f that represents the next value of the BDI. The values of w p1 , w p2 and w p3 are learned in supervised learning using basic forward-and back-propagation techniques as in the neural network. As explained in Section 2.2, all of them are powerful sequence models. Deep RNN is the best recurrent model, which is able to learn time dependency to predict the next value; however, it cannot be combined with long-term dependency. LSTM is equipped with complex cell memory to handle long-term dependency. As for GRU, whereas it harnesses cell memory as well, it uses a simpler cell than LSTM. Therefore, for obtainment of more accurate results, we combined all of those models in an ensemble called deep ensemble recurrent network (DERN). The purpose behind this was to minimize the error rate, since in terms of memory cells, RNN is the simplest (under-fit) model, while GRU is a simpler model than LSTM albeit more complex (over-fit) than RNN, which complexity can be denoted as RNN < GRU < LSTM. Moreover, in machine-learning theory, there is no method that is universally better than any other method (the "no free lunch" theorem), and each method may make mistakes in different facets of operation. Stacking of multiple different sequential models may lead to performance improvement over individual models. The multi-model ensemble is a technique by which the predictions of the collection of models are given as inputs to a second-stage learning model. The second-stage model is trained to combine the predictions from the first-stage models optimally in order to obtain a final set of predictions.
The mechanism that we used to make short-and long-term predictions in the present study is depicted in Figure 3. Figure 3a is one-step-ahead (short-term) prediction model. Therefore, we transform the univariate time-series data of BDI into x i as a predicted variable and x i+1 as a response variable. Afterwards, to predict the x i+2 , x i+1 is appended to the training data. In Figure 3b meanwhile, the model predicts multi-values of BDI at a time; thus, it predicts multiple sequences from x i+1 to x n at once, where n is the number of step predictions. The challenging part of this technique is creating a model that performs long-sequence prediction at once.  // Predict P 3 given data D and model M 3 end for t ← 1 to T do Learn w P1 , w P2 , w P3 based on D // w P1 , w P2 , w P3 is weight of P 1 , P 2 and P 3 , respectively end for t ← 1 to m do We conducted some extensive experiments to decide the hyper-parameters of our models. After some trials and errors, the optimal architecture of RNN, LSTM, and GRU are described in Table 2. We utilized two stacked of recurrent layers, such as two-layer stacked of RNN, LSTM, and GRU layer in RNN, LSTM, and GRU model, respectively. Each layer consist of 500 hidden unites with 20% dropout units and tanh as gate activation. We set the output of the recurrent layers to return a sequence. The length of the sequence corresponds to the number of steps of prediction. For instance, if the task is a five-steps-ahead prediction, the sequence length is five. The sequence will be wrapped by one TimeDistributed layer with the number of hidden units correspond to the sequence length, and will be activated by Sigmoid function. Each model is trained independently until converge by using Mean Square Error (MSE) as a loss function and Adam as an optimizer. In the ensemble layer, the output of each model is trained by a standard neural network with one dense layer, Sigmoid activation function, and Adam optimizer is employed to decide the final prediction. To assess how well our prediction predicted short-and long-term BDI, Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Error (MAPE), denoted in Equations (13)-(15), respectively, were employed, where n corresponds to the number of data, y t is the actual value of BDI at time t, andŷ corresponds to the predicted BDI at time t.

Experiments
The original BDI data is stored on a daily basis starting from November 1999 and extending to February 2018. To make our prediction simpler in performing long-term prediction, we sampled the data on a weekly basis by using average values. The summary statistics of BDI data is shown in Table 3. In the experiments, the data-shuffling technique was not implemented; instead, we employed a sliding window technique to split our training and testing data. The data was divided into 70-30%, 80-20% and 90-10% slices for training and testing. We compared the results with deep-learning models such as Deep RNN, LSTM, and GRU. Further, we compared our proposed method with ARIMA and Multi-Layer Perceptron (MLP). To obtain the best parameters and architectures, the grid search technique was employed.

Short-Term Prediction
Short-term prediction finds one-step-ahead BDI data; the predicting mechanism follows the model in Figure 3a, due to the fact that the data is weekly. Therefore, it predicts one week ahead of the BDI. Figure 4 depicts the prediction of BDI in the testing phase using ARIMA, MLP, LSTM, Deep RNN, GRU, and DERN. As we can see, all of them approximately predict the BDI correctly. In this case, the worst method is the ARIMA model. As shown in Table 4, the ARIMA error rate is three times higher than DERN in terms of RMSE. Notice that by changing the portion of training data, the error rate is uncertain since the BDI data is volatile, as indicated in Table 3: the swings the data takes relative to the variance are enormously increased in the interval from 25 to 75%.   Table 4 shows that DEEP RNN, LSTM, and GRU roughly have a similar error rate. GRU slightly outperforms LSTM, while LSTM outperforms Deep RNN. This was due to GRU and LSTM having a cell memory gate to handle long-term dependency. It is known that GRU and LSTM have the same mechanism for effective tracking of long-term dependencies while mitigating the vanishing/exploding gradient problems. The LSTM uses more complex gates than GRU. Therefore, in this case, the LSTM model, relative to GRU, tended to over-fit more. The Figure 5 shows that DERN is the nearest result to the original data, and this is an average value among RNN, GRU, and LSTM, due to our having combined them into an ensemble using a weighted technique. Moreover, in its short-term prediction of BDI, our approach outperforms the previous, Artificial Neural Network (ANN) model, for which the average of MAPE was never lower than five [5]. In the ensemble layer, we obtain the weight w P1 , w P2 , and w P3 of RNN, LSTM and GRU, respectively. The value of w P1 , w P2 , and w P3 are 0.337, 0.330, and 0.333, respectively. We can infer that in short-term prediction, each of the models has an approximately equal effect to the prediction value.

Long-Term Prediction
In long-term prediction, we try to predict the BDI more than one step ahead. In the present study, value three, five and seven weeks ahead of BDI were predicted. The experiment showed that long-term prediction resulted in a higher error rate than short-term prediction in terms of RMSE, MAE, and MAPE; therefore, we conducted the experiment up to seven-steps-ahead prediction. The experimental results of the long-term prediction are shown in Tables 5-7 for three-, five-and seven-steps-ahead prediction of BDI. As indicated in all of those tables, DERN obtained the best error rate among the methods. In this study, ARIMA failed to predict long-term BDI, its average error rate being more than seven times higher than that of our proposed method. A visualization of the comparison of RNN, LSTM, and GRU with DERN is provided in Figures 6-8 for three-, five-, and seven-ahead prediction of BDI, respectively. Notice that the error rate trend is increasing over time. Even though the error rate grows with the increasing number of steps, the models follow the trend of the testing data. The overall error rate averages in terms of RMSE, MAE and MAPE, respectively, are plotted in Figure 9. Note that ARIMA was omitted due to its large error. From the data, we could infer that one-step-ahead prediction results in a much lower error rate than that of long-term prediction. Therefore, long-term prediction is more challenging than short-term prediction; nonetheless, ship-owners and stakeholders commonly are more interested in long-term prediction. The weight of ensemble layer in three-steps-ahead prediction are 0.281, 0.357, and 0.362 for RNN (w P1 ), LSTM (w P2 ), and GRU (w P3 ) respectively. In the five-steps-ahead prediction is 0.300, 0.350, and 0.350 for w P1 , w P2 , and w P3 , respectively. While in seven-steps-ahead prediction are 0.294, 0.362, and 0.344 for w P1 , w P2 , and w P3 , respectively. Unlike in short-term prediction, RNN has a smaller effect on the final prediction, while LSTM and RNN are approximately the same effects to it.

Conclusions and Outlook
The Baltic Dry Index (BDI) is a parameter representative of international shipping activities. It is an essential tool with which ship-owners and stakeholders plan their maritime businesses and avoid market risk. Unlike common time-series data, the BDI index is characterized by volatility, non-stationarity, and complexity; therefore, its prediction is very challenging indeed. In previous work, most researchers have used the Artificial Neural Networks (ANN) and statistical models. In keeping with the popularity of deep learning in this decade, in this paper, we propose a deep-learning approach whereby the deep sequential models (RNN, LSTM, and GRU) are combined in an ensemble called Deep Ensemble Recurrent Networks (DERN) for accurate prediction of short-and long-term BDI. In short-term prediction using the RMSE indicator, DERN had an error rate roughly a half of GRU, LSTM, Deep RNN and MLP, and approximately a third of ARIMA. However, in long-term prediction, the error rate was not as good as the short-term prediction. Specifically, the results showed that with increasing prediction steps, the error rate grew. Therefore, the long-term prediction is more challenging than short-term prediction. Nonetheless, DERN still outperforms the conventional methods in long-term prediction. Ship-owners and stakeholders, not to mention investors, prevalently are more interested in long-term prediction. In future work, we will propose a more fine-grained approach entailing sequence-to-sequence learning for more accurate long-term prediction.

Conflicts of Interest:
The authors declare no conflict of interest.