Comparison of Different Approaches of Machine Learning Methods with Conventional Approaches on Container Throughput Forecasting

: Container transportation is an important mode of international trade logistics in the world today, and its changes will seriously affect the development of the international market. For example, the COVID-19 pandemic has added a huge drag to global container logistics. Therefore, the accurate forecasting of container throughput can make a signiﬁcant contribution to stakeholders who want to develop more accurate operational strategies and reduce costs. However, the current research on port container throughput forecasting mainly focuses on proposing more innovative forecasting methods on a single time series, but lacks the comparison of the performance of different basic models in the same time series and different time series. This study uses nine methods to forecast the historical throughput of the world’s top 20 container ports and compares the results within and between methods. The main ﬁndings of this study are as follows. First, GRU is a method that can produce more accurate results (0.54–2.27 MAPE and 7.62–112.48 RMSE) with higher probability (85% for MAPE and 75% for RMSE) when constructing container throughput forecasting models. Secondly, NM can be used for rapid and simple container throughput estimation when computing equipment and services are not available. Thirdly, the average accuracy of machine learning forecasting methods is higher than that of traditional methods, but the accuracy of individual machine learning forecasting methods may not be higher than that of the best conventional traditional methods.


Introduction
Container shipping is an important form of international trade logistics, and a great deal of goods are transported from the origin to the consumption place far across the ocean by container shipping [1]. However, the spread of COVID-19 has had a profound impact on container shipping and will even overturn the trend of container shipping in the future [2]. Since the third quarter of 2020, there has been a global shortage in the supply of empty containers, and major shipping companies are short of shipping space [3]. The advance booking period of Sino-European routes is about two weeks, and even the Sino-American routes are sold out. The empty container supply and lack of capacity directly leads to a rapid rise in container service charge, and the Shanghai Containerised Freight Index (SCFI) and the Freight Baltic Index (FBX) are obviously up. For example, the price of a container shipped from China to Europe has risen from $2000 to $15,000, a 7.5 times higher transportation cost than before [4]. The rapidly rising price of container transportation has brought heavy burden to international trade, and the price of all kinds of goods transported by container has also risen sharply. Therefore, improving the efficiency of container shipping is an important way to better the performance of international trade and reduce trade costs. Previous practitioners and scholars have conducted studies on improving the efficiency of container shipping from many aspects [5][6][7][8]. This study focuses on the forecasting of port container throughput, because the accurate forecasting results of port container throughput can provide decision support for shipping companies, port owners, freight forwarders, and other container shipping participants. With the development of machine learning, a variety of sophisticated forecasting models were proposed based on machine learning algorithms. However, many up-to-date forecasting methods were not applied to forecast container throughput. It is necessary to compare the performance of advanced machine learning methods and conventional methods on container throughput forecasting. Therefore, the research question of this study is which of the existing forecasting methods is more accurate in forecasting container throughput.
The main contributions of this study are as follows. First, the performance of nine different time series forecasting methods on a single time series is compared, including conventional methods and machine learning methods. Secondly, the forecasting method GRU, which is accurate for short time series forecasting results, is obtained through comparison, which provides experience for future forecasting research. Thirdly, it is found that the forecasting results of machine learning algorithms on short time series are not necessarily better than those of conventional methods, and the more complex models tend to produce less ideal forecasting results.

Literature Review
From the perspective of learning mechanisms of forecasting models, we can divide them into two categories: conventional forecasting models and machine learning forecasting models. Conventional forecasting models are those that use simple rules or methods to forecast future values, such as the naïve method (NM), moving average (MA), autoregressive (AR) and autoregressive integrated moving average (ARIMA), etc. Machine learning forecasting models are those that employ more complex computational methods and model structures to extract underlying patterns from the data, such as multilayer perceptron (MLP), recurrent neural network (RNN), convolutional neural network (CNN) and Transformer, etc. The summary of the literature review is presented in Table 1.
Among the conventional forecasting models, the naïve method is the simplest but most effective time series forecasting method [9]. It takes the actual value at time t − 1 as the forecasting value at time t. In actual production, many enterprises choose to use the naïve method as the basic forecasting method to guide their operations plan. The naïve method is also used as a benchmark for the evaluation of the performance of other forecasting methods [10]. For any designed forecasting model, the method is valid if its accuracy is higher than the naïve method's, and vice versa. It is similar to random guess in classification problems. The moving average is another method commonly used to forecast future value [11]. It uses the average of a group of recent actual values to forecast future values, such as demand and capacity, etc. However, this method can only be used when the demand is neither rapid growth nor rapid decline, and there is no seasonal factor. Previous studies investigated optimal MA length for forecasting future demand. Their findings suggest that optimal MA length is related to the frequency of occurrence of the structural change [12]. The autoregressive model is developed from linear regression in regression analysis and used to deal with time series [13]. It uses the historical values of the same variable (y t−1 to y t−n ) to forecast the current y t . Because an autoregression model only uses the historical value of a variable to forecast its future value, it does not use other variables, so it is called autoregressive. Many studies have analysed and improved AR [13][14][15][16]. Furthermore, Box and Jenkins integrated AR and MA methods and added an integrated method to put forward the ARIMA time series forecasting model [17]. On this basis, ARIMAX and SARIMA were designed to handle multivariate input data and seasonal input data, respectively. Many studies use ARIMA and its derived models to forecast the future value of the target and obtain acceptable forecasting accuracy [18][19][20]. The traditional method is used by many enterprises because of its simple deployment and fast computing speed. However, these methods are difficult to obtain complex influence relationships from a large number of influencing factors, so scholars put forward more complex and effective forecasting models called machine learning (ML) [21].
MLP is a kind of neural network machine learning model which attracts a great deal of attention [22]. It is a fully connected feedforward artificial neural network and has been employed as a benchmark to test the forecasting performance of other forecasting models [23][24][25]. MLP was improved by integrating other forecasting models [26][27][28][29]. The concept of deep learning originates from the development of the artificial neural network [30]. MLP with multiple hidden layers can be considered as a deep learning structure [31]. By combining low-level features, deep learning can form more abstract high-level attributes or features to discover distributed feature representations of data [32]. There are many architectures for deep learning, among which RNN is a common architecture. Many complex and well-performing deep learning architectures are based on RNN [33]. RNN has good processing ability for sequential structure data and is often used in language processing problem. Gated recurrent unit (GRU) and long short-term memory (LSTM) are two representative RNN architectures. For instance, Noman et al. proposed a GRU based model to forecast the estimated time of arrival for vessels. Their experimental results show that the GRU-based model can produce the best forecasting accuracy compared to other methods [34]. Moreover, Chen and Huang employed Adam-optimised GRU (Adam-GRU) to forecast port throughput. Their findings can be concluded as Adam-GRU can produce relatively accurate forecasting results [35]. Shankar et al. built a container throughput forecasting model by using LSTM. Their experiment showed that LSTM can also generate accurate forecasting results [36]. CNN is another commonly used deep learning architecture. It was originally used to solve computer vision problems, such as image recognition, and later some scholars applied CNN to the analysis and forecasting of sequence data. For instance, Chen et al. proposed a temporal CNN to estimate probability density of time series [37]. There are many studies that employed CNN to build time series forecasting model [38][39][40][41]. More recently, Transformer, another deep learning architecture, was first proposed by Google Brain in 2017 to solve the sequential data problem, such as natural language processing (NLP) [42]. It features all input data into the model at once, and uses positional encodings, attention, and self-attention mechanisms to capture the patterns from the data. Based on Transformer, scholars also put forward powerful NLP models such as GPT-3 [43], BERT [44], T5 [45], etc. Later, some scholars applied Transformer to time series forecasting, because time series data and text data are both sequential data [46]. Experimental results show that Transformer can produce more accurate results in time series forecasting than previous work. There have been a number of recent studies using Transformer for forecasting. All these studies suggest that Transformer has a good performance in time series forecasting [46][47][48][49].
However, these studies only assessed some of these methods' performance, but no research has investigated the performance of these methods on the same time series simultaneously. Thus, which method performs better on the same time series for container throughput remains unclear. In this context, the aim of this study is to compare several existing forecasting methods for the container throughput in the same port. Then, insights for selecting an appropriate method can be suggested.  Proposed PSO-MLP model addresses the drawbacks of the MLP-only model performs better than conventional artificial neural networks (ANNs) and statistical models. [25] MLP, linear regression (LR) Covid -19 positive case from March to mid-August 2020 in West Java MLP reaches optimal if it used 13 hidden layers with learning rate and momentum = 0.1. The MLP had a smaller error than LR.
[26] random forest, MLP Electrical load data of six years from a university campus Hybrid forecast model performs better than other popular single forecast models.
[27] MLP, Whale optimization algorithm Read gold price The proposed WOA-NN model demonstrates an improvement in the forecasting accuracy obtained from the classic NN, PSO-NN, GA-NN, GWO-NN, and ARIMA model. [28] Dynamic regional combined short-term rainfall forecasting approach, MLP Actual height, temperature, tempera ture dew point difference, wind direction and wind speed at 500 hPa height DRCF outperforms existing approaches in both threat score (TS) and root mean square error (RMSE).
[29] local MLP Simulated data A greater degree of decomposition leads to the greater reduction in forecast errors.
[34] GRU Vessels that travel on the inland waterway GRU provides the best prediction accuracy.
[37] DeepTCN JD-demand, JD-shipment, electricity, traffic and parts The framework compares favorably to the state-of-the-art in both point and probabilistic forecasting.
[38] CNN Bid and ask CNNs are better suited for this kind of task.
[39] LSTM, CNN Electric load dataset in the Italy-North Area The experimental results demonstrate that the proposed model can achieve better and stable performance in STLF.
[40] CNN Australian solar PV power data Convolutional and multilayer perceptron neural networks performed similarly in terms of accuracy and training time, and outperformed the other models.
[41] Nonpooling CNN Simulated data, daily visits to website Convolutional layers tend to improve the performance, while pooling layers tend to introduce too many negative effects.
[46] Transformer ILI data from the CDC Transformer-based approach can model observed time series data as well as phase space of state variables through time delay embeddings. [47] Enhancing the locality of Transformer, breaking the memory bottleneck of Transformer

Materials and Methods
This study compares the performance of nine different time series forecasting methods on the same time series, including traditional methods, which are the naïve method (NM), moving average (MA), autoregressive (AR) and autoregressive integrated moving average (ARIMA), and machine learning methods, which are multilayer perceptron (MLP), recurrent neural network (RNN), convolutional neural network (CNN) and Transformer. This section explains the technical details of these nine methods, such as calculation methods, flow charts, parameter definitions, etc.

Conventional Approaches
Conventional forecasting approaches mainly refer to methods with simple calculation process, few adjustable parameters, fast calculation speed and poor learning ability for complex nonlinear relations, such as NM, MA, AR, and ARIMA. This subsection is to explain the technical details of these conventional approaches.

Naïve Method
The expression of NM is shown in Equation (1), where y t is the forecasted result of target variable at time t, and y t−1 is the real value of target variable at time t − 1.

Moving Average
The expression of MA is shown in Equation (2), where y t is the forecasting result at time t, y t−i is the real observation at time t − i, and n is the size of the moving windows.

Autoregressive
The expression of autoregressive method is shown in Equation (3) [50], where φ p is the autoregressive operator, p is the autoregressive order, Y t is the real time series at time t, and a t is the Gaussian white noise with zero mean and σ 2 .

AutoRegressive Integrated Moving Average
ARIMA consists of three parts: AR, integration (I), and MA, and the corresponding parameters are p, d, q respectively. The general ARIMA model is called ARIMA (p, d, q). The expression of ARIMA is shown in Equation (4) [21], where B is the back-shift operator, and a t is the Gaussian white noise with zero mean and σ 2 . The expression of each parameter is shown in Table 2 [21].

Machine Learning
Machine learning forecasting methods mainly refer to methods with a complex calculation process, many adjustable parameters, slow calculation speed, and strong learning ability for complex nonlinear relations. These methods, such as MLP, RNN, CNN, and Transformer, can obtain better fitting results by adjusting a large number of parameters.

MLP
MLP is an interconnected network composed by many simple neurons. When the input signal to the neuron exceeds the threshold, this neuron will be at excitatory state and then send information to downstream neurons and repeat the above steps. The basic structure of MLP is shown in Figure 1. The input data is connected to the neurons in input layer (L n ), and there is a full-connection architecture between the neurons in input layer (L n ) and the neurons in hidden layer (H n ). Each connection to the downstream neurons is weighted. Similarly, neurons in hidden layer (H n ) and neurons in output layer (O n ) are fully connected with weighted lines [51].
First, the values in each layer are vectorised: The output of the input layer is where σ is the activation function, w H is the vector of the weight of the linkage between the input layer and the hidden layer, and b H is the vector of the threshold value of the neurons in hidden layer. The output of the hidden layer is where w O is the vector of the weight of the linkage between the hidden layer and the output layer, b O is the vector of the threshold value of the neurons in the hidden layer.

GRU
As mentioned earlier, a GRU is an RNN structure, and the recurrent model of a common RNN is shown in Figure 2. RNN is commonly composed of one or more units (the green rectangle A in the Figure 2), and the learning model is constructed by iteratively updating the parameters in the units. The basic structure of a GRU unit is shown in Figure 3. The calculation expressions of the parameters are shown in Equations (10)-(13) [52].
where x t is the input vector, h t is the output vector,ĥ t is the candidate activation vector, z t is the update gate vector, r t is the reset gate vector, W, U and b are parameter matrices and vectors, and σ g and φ h are the activation functions.

LSTM
LSTM is another type of RNN with the same recurrent model as Figure 2. Figure 4 presents the common structure of an LSTM unit. There are three types of gates in the unit, which are the input gate, forget gate, and output gate. The calculation expressions of the parameters of LSTM are shown in Equations (14)- (19) [53].
where x t is the input vector, f t is the forget gate's activation vector, i t is the update gate's activation vector, o t is the output gate's activation vector, h t is the output vector,Ĉ t is the cell input activation vector, C t is the cell state vector, W, U and b are parameter matrices and vectors, σ g , φ c and φ h are activation functions.

CNN
The CNN is constructed by an input layer, convolution layer, pooling layer, fully connected layer, and output layer. The input data in the input layer is first convoluted by a convolution kernel to make a convolution layer. Then, the pooling layer is to use the pooling method, such as max pooling, average pooling, etc., to effectively reduce the size of the parameter matrix, thereby reducing the number of parameters in the fully connected layer. Therefore, adding the pooling layer can speed up the calculation and prevent overfitting. After the pooling process, the pooled data is fed into the fully connected layer, which can be treated as the traditional multi-layer perceptron. The input of the fully connected layer is the feature extracted by the convolution layer and the pooling layer. The last output layer can use logistic regression, softmax regression, or even support vector machine to generate the final output. The network model adopts the gradient descent method to minimise the loss function to reverse-adjust the weight parameters in the network layer by layer, and improves the accuracy of the network through frequent iterative training.
The CNN is originally designed to deal with computer vision problems and the default input is the RGB image. This type of CNN is called as 3DCNN, because the RGB image can be filtered into three sub-image with RGB colours. If the input data is time series, then the CNN is called as 1DCNN. The basic structure of 1D-CNN is shown in Figure 5 [54].

Transformer
Transformer is the first transformation model that fully relies on self-attention to compute input and output representations without using recurrent or convolution mechanism. Self-attention is sometimes called as intra-attention. When a dataset is fed into the transformer, the data will first pass through the encoder module to encode the data, and then the encoded data will be sent to the decoder module for decoding. After decoding, the processed result will be obtained. The basic structure of Transformer is shown in Figure 6 [46]. It can be seen that the encoder input is fed into the input layer in the encoder, and then the positional encoding is used to inject some information about the relative or absolute position of the tokens in the sequence [42]. Then the encoder layer 1 and encoder layer 2 are used to encode the data. Here, the number of the encoder layers in the encoder can be defined by users. After then encoding process, the encoder output is fed into the decode layer 1 in the decoder. At the same time, decoder input is fed into the input layer of the decoder. Then the output of input layer is also fed into the decoder layer 1. After the process by decoder layer 2 and linear mapping, the final output can be obtained. Similarly, the number of the decoder layers in the decoder can also be defined by the users.

Process of Comparison
The comparison process is shown in Figure 7. The first step is to send the top 20 container ports' throughput into the methods that need to be compared. Then the forecasting results are analysed from the perspectives of intra-method and inter-method, respectively. As an example, the pseudocode of learning and forecasting processes of MLP is presented in Algorithm 1. In the line 1, a range of the hidden layer size of the MLP is predefined. Then a variable named output, with an empty value, is predefined to hold the results generated by MLP methods with different hidden layer sizes. From line 3 to line 14, there are two f or loops to get the forecasting results. More details about the searching range of each method can be found in Table 3. The source code of each forecasting method and the comparative drawing can be found at: https://github.com/tdjuly?tab=repositories (accessed on 20 September 2022).

Data Description
In this study, annual container throughput from 2004 to 2020 was obtained from the official websites of the world's top 20 container ports. For each port, there are 17 observations. The statistical description of the data is shown in Table 4 and the time plots of the container throughput of the world's top 20 container ports are shown in Figure 8. It can be seen from the figure that the annual container throughput of most ports shows a trend of gradual increase, such as Antwerp, Guangzhou, Qingdao, Ningbo, Busan, etc. However, some, such as Hong Kong, showed a downward trend. Some ports, such as Dalian and Dubai, showed a trend of increasing first and then decreasing.
Before the experiment, the obtained data should be divided into the training set and testing set. The training set is used to tune the parameters of the model, so that the forecasting results of the model will be closer to the real value. The testing set is used to test the accuracy of the trained model on the new data. According to Al-Musaylh et al. (2018), 80/20 is a common ratio of training and testing sets [55]. Therefore, the training set includes 13 observations (76%) from 2004 to 2016. The testing set includes four observations (24%) from 2017 to 2020.

Results and Discussion
In this study, according to the flow of Algorithm 1, we completed the comparison of nine forecasting methods on the data collected. This chapter analyses the forecasting results from both intra-method and inter-method perspectives. In the intra-method comparison, we focus on comparing the forecasting performance of the same method in different time series, and analyse the reasons for this observation. In the inter-method comparison, we focus on the forecasting performance of different methods in the same time series, and analyse the reasons for this observation. Finally, according to the phenomena and reasons obtained, we draw conclusions and guide the subsequent forecasting research. Figure 9 presents the MAPEs and RMSEs of MLPs with different hidden sizes. It can be seen that 80% (16/20) of the container throughput time series can find lower MAPE and RMSE by increasing the hidden layer size of MLP, which are Antwerp, Busan, Dalian, Dubai, Guangzhou, Hong Kong, Kaohsiung, Kelang, Long Beach, Los Angeles, Ningbo, Rotterdam, Shanghai, Shenzhen, Tanjung, and Xiamen. In addition, many time series find the minimum error when the number of MLP layers is small, such as Antwerp, Kelang, Long Beach, Tanjung, etc. This observation indicates that increasing the size of the hidden layer is useful for finding models with higher forecasting accuracy when using MLP to build forecasting models. The number of MLP layers corresponding to the optimal result may not be too large. Figure 10 presents the MAPEs and RMSEs of GRUs with different hidden sizes. One obvious experimental result is that MAPE and RMSE values of all ports decrease with the increase of hidden layer size, which means that the GRU forecasting accuracy of all ports becomes more accurate with the increase of hidden layer size. However, the growth rate of forecasting accuracy decreases rapidly at a certain stage (hidden layer size ≈ 100). This observation suggests that when using GRU to build the forecasting model, 100 can be selected as the initial hidden layer size considering the forecasting accuracy, computational complexity, and other factors, and then the hidden layer size can be modified to find the most appropriate hidden layer size.  The MAPEs and RMSEs of LSTMs with different hidden sizes are presented in Figure 11. It can be seen that the values of MAPE and RMSE increase as the hidden layer size increases, which means that the LSTM forecasting accuracy of all ports becomes worse with the increase of hidden layer size. The possible reason for this situation is that LSTM model is good at analysing time series with a long time span and a large number of observations. For container throughput data, the LSTM model with a small time span and limited number of observations cannot accurately obtain the rules in container throughput, and it is easy to produce under-fitting results, which leads to lower accuracy with the increase of LSTM hidden layer size.

Intra-Method Comparison
This observation suggests that when LSTM is used for forecasting model construction, it is not necessary to select a large hidden layer size.  It can be seen that around 90% of the CNN forecasting models can find the minimum MAPE and RMSE values under the increasing number of filters, which are Busan, Dalian, Dubai, Guangzhou, Hamburg, Hong Kong, Kaohsiung, Kelang, Los Angeles, Ningbo, Qingdao, Rotterdam, Shanghai, Shenzhen, Singapore, Tanjung, Tianjin, and Xiamen. There are also some unsynchronised changes in MAPE and RMSE, which are Guangzhou, Qingdao, Shenzhen, and Singapore. The unsynchronised change may be due to the change of extreme value in the forecasting results, because RMSE is more sensitive to the change of extreme value. Overall, this observation suggests that when CNN is used to build the container throughput forecasting model, it is necessary to increase the number of filters to search for the model that can produce the most accurate results.    Table 5 presents the test sets MAPE obtained by nine methods based on the results of the optimal training set on the container throughput time series of 20 ports. It can be seen that 17 of the 20 port container throughput time series obtained the minimum MAPE by using GRU. For the remaining three container throughput time series, the minimum MAPE of two series is generated by ARIMA, and the minimum MAPE of one series is generated by Transformer. The possible reason is that the length of the container throughput time series is too short to be explored by methods other than GRU. Similar observations suggest that GRU is able to perform better on certain smaller, less frequent datasets [56,57]. As a method with similar structure to GRU, LSTM performs worse than GRU. The possible reason is that due to the short length of time series used in this study, LSTM cannot obtain enough patterns from too short time series. For the time series of Guangzhou Port, the method to generate the minimum MAPE is ARIMA, but its result is very close to that of GRU, which are 2.2039 and 2.2658, respectively. This indicates that GRU can also produce relatively accurate results, but ARIMA is slightly more accurate.

Inter-Method Comparison
In terms of the average MAPE of 20 ports, the best performing method is GRU, followed by CNN and NM. Surprisingly, the simplest NM method ranked third in the forecasting accuracy. Considering the simplicity, convenience, and ease of operation of the NM method, NM can be used for rapid and simple container throughput estimation when computing equipment and services are not available. This finding is consistent with previous studies, which also found that although NM results are not as good as other methods, the accuracy is very close [58]. Another finding is that the average performance of machine learning methods is better than the average performance of traditional methods, 7.89 and 8.39, respectively.  Table 6 presents the test sets RMSE obtained by nine methods based on the results of the optimal training set on the container throughput time series of 20 ports. It can be seen that the results of RMSE are similar to those of MAPE, and the best forecasting method is still GRU. However, there are also slight differences. The number of best performing ARIMA has increased from 2 to 3, and the number of best performing Transformer has increased from 1 to 2. The possible reason is that for some time series, GRU produces larger errors where the actual value is higher, which leads to larger differences. Moreover, because RMSE is highly sensitive to the extreme value of errors, the results of RMSE for some time series are not ideal when GRU performs well in terms of MAPE.

Conclusions
This research is a comparison study of nine forecasting methods on container throughput time series, four of which are traditional regression-based methods, and five of which are machine learning-based methods. The main finding of this study is that GRU is a method that can produce more accurate results with higher probability when constructing container throughput forecasting models. Another finding is that NM can be used for rapid and simple container throughput estimation when computing equipment and services are not available. The study also confirmed that machine learning methods are still the better choice over some traditional methods. An important conclusion that can be drawn from the analysis of experimental results is that machine learning methods are useful for training forecasting models, but the characteristics of the data can affect the performance of the methods. Therefore, machine learning methods are not necessarily better than traditional forecasting methods. In other words, one should be cautious about using machine learning methods to build forecasting models. This study compares the performance of different methods on multiple time series, and these time series are characterised by short observation period and small number of observations. Therefore, the conclusion of this study is applicable to any time series with the same time characteristics as this study.
Although this study explores the performance of nine different methods in forecasting the throughput of the world's top 20 container ports, there are still limitations. As the hub of world trade, the change of port throughput is not only determined by the port city, but also determined by the operation situation of various ports around the world and the development of world trade market. This study only uses historical port throughput data as the data source. Therefore, the future research direction is to add the above influencing factors such as the development of port facilities, economic data of port cities, transportation between the port and other ports into the forecasting model and analyse their impact on port container throughput. Data Availability Statement: Publicly available datasets were analysed in this study. This data can be found here: https://github.com/tdjuly/Port-container-throughput (accessed on 20 September 2022).