Forecasting the Demand for Container Throughput Using a Mixed-Precision Neural Architecture Based on CNN–LSTM

: Forecasting the demand for container throughput is a critical indicator to measure the development level of a port in global business management and industrial development. Time-series analysis approaches are crucial techniques for forecasting the demand for container throughput. However, accurate demand forecasting for container throughput remains a challenge in time-series analysis approaches. In this study, we proposed a mixed-precision neural architecture to forecasting the demand for container throughput. This study is the ﬁrst work to use a mixed-precision neural network to forecast the container throughput—the mixed-precision architecture used the convolutional neural network for learning the strength of the features and used long short-term memory to identify the crucial internal representation of time series depending on the strength of the features. The experiments on the demand for container throughput of the ﬁve ports in Taiwan were conducted to compare our deep learning architecture with other forecasting approaches. The results indicated that our mixed-precision neural architecture exhibited higher forecasting performance than classic machine learning approaches, including adaptive boosting, random forest regression, and support vector regression. The proposed architecture can e ﬀ ectively predict the demand for port container throughput and e ﬀ ectively reduce the costs of planning and development of ports in the future.


Introduction
Globalization has facilitated the rapid growth of international trade. The degree of containerization has emerged as a competitive advantage for countries in this globalized trading environment [1]. Island countries rely heavily on imports and are considerably affected by changes in the global economy. In this rapidly changing competitive environment, port operations, construction, and upgrading of port facilities are critical. Governments and industries accord considerable attention to these aspects. Container throughput is essential information for port infrastructure investment and construction, and it is a substantial irreversible investment. The process of construction of a new port or upgrading an already existing port requires considerable time, and availability of port facilities is considerably restricted during the construction process [2]. If the trade volume and container throughput of the seaport is unbalanced, then the performance and international competitiveness of the seaport is severely affected [3]. Therefore, failure to forecast the container throughput may result in substantial financial losses. Although researchers have developed various approaches for forecasting container throughput based on a linear system, most methods are an extension of the classic time-series model, including autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA) [4], exponential smoothing [5], gray forecasting [6], and regression analysis [7]. However, a container throughput time series consists of linear and nonlinear trends. Therefore, achieving effective forecasting is difficult. In contrast to previous studies, in our work, a forecasting method was developed based on a mixed-precision neural architecture to forecast future container throughput. This study focused on investigating the capability of the convolutional neural network (CNN) for learning the strength of the features of the container throughput and the effectiveness of long short-term memory (LSTM) to identify the crucial internal representation of time series depending on the strength of the features. In particular, the contribution of this study consists of three parts. First, the state-of-art method in a mixed-precision neural architecture was investigated for throughput forecasting. The proposed hybrid CNN-LSTM model exhibited considerable advantages of both CNN and LSTM. Finally, the robustness of this mixed-precision neural architecture was tested using the five-container throughput with different trends. This study is the first to use a mixed-precision neural network to forecast container throughput. This work provides a solid foundation for a novel neural network architecture in order to guide future studies to develop more effective approaches based on our foundation.
The remainder of this paper is organized as follows. Section 2 presents a brief review of previous works related to container throughput forecasting. Section 3 introduces the CNN and LSTM and presents the CNN-LSTM forecasting approach in container throughput forecasting. Section 4 presents the empirical results and a discussion. Finally, Section 5 summarizes the conclusions and future research directions.

Literature Review
Container throughput forecasting first emerged in the 1980s. Since then, considerable development has been achieved in the field [8,9]. Most forecasting methods, such as autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA) [4], exponential smoothing [5], gray forecasting [6], and regression analysis [7], assume a linear trend. However, a container throughput time series consists of linear and nonlinear trends. Therefore, achieving effective forecasting is difficult. An increasing number of studies have developed nonlinear forecasting approaches to address nonlinear systems.
Niu presented a novel hybrid decomposition ensemble method to decompose container transportation volume series into low-frequency and high-frequency components. This method involves the use of the moving-average model and support vector regression (SVR) to process the two components and generate the final throughput forecasting [10]. Chen used genetic programming (GP), decomposition methods, and SARIMA methods to predict Taiwan's container throughput and concluded that the GP model outperformed the other two methods [11]. Syafi'i et al. introduced a vector error correction method based on a vector autoregression to predict the container transportation volume [12]. Their method was promising and had considerable development potential. Guo et al. proposed an improved gray Verhulst model, which overcomes the problem of the increase in the forecast error when the container transportation volume exhibits an S-shaped trend. Their finding indicated that the proposed gray Verhulst model not only achieved high prediction accuracy but also retained the strengths and characteristics of the gray system model [13].
Gosasang et al. used multilayer perceptron (MLP) and linear regression (LR) forecasting models with multivariate data that collect data from public institutions to forecast cargo throughput; the models were compared in terms of their root mean squared errors (RMSEs) and mean absolute errors (MAEs). Their results suggested that MLP forecasting outperformed LR forecasting [14]. Geng et al. proposed a hybrid model with multivariate adaptive regression splines for selecting SVR input variables. Furthermore, SVR fails to provide reasonable performance without appropriate parameters. This study used the chaotic simulated annealing particle swarm optimization algorithm to determine the optimal SVR parameters and improve the model efficiency [15].
Deep learning has yielded promising results in many research areas and has attracted considerable academic and industry attention [16]. Deep learning has considerable potential for use in several applications, such as supervised (classification) and unsupervised learning (clustering) tasks, natural language processing, dimensionality reduction, biology, medicine, and motion modeling [17][18][19][20][21][22]. The advantage of deep learning is that it creates a mixture function set of numerous nonlinear transformations and useful expressions to produce more abstraction and consequently more benefits [17].
None of the aforementioned studies have involved considering a mixed-precision neural architecture to forecast container throughput. In this study, the univariate time-series forecasting method based on a mixed-precision neural architecture was proposed to forecast future container throughput.

Proposed Method
Among deep learning techniques, LSTM networks and CNNs are probably the most popular, efficient, and widely used deep learning techniques [23]. These models are typically used for time series because the appropriate use of different networks can achieve superior results. The LSTM model is inherently designed with a special storage mechanism that can effectively capture sequential schema information [24]. CNNs achieve superior data noise to extract important features and increase the performance of the final prediction model. However, a typical CNN is suitable for processing spatial autocorrelation data but cannot correctly handle complex and long-term time dependence [17]. Therefore, considering the strengths of the two deep learning techniques, we proposed a novel mixed-precision architecture based on the CNN-LSTM hybrid neural network that can better handle nonlinear systems and enhance the forecasting performance. The model architecture is presented in Figure 1. The proposed mixed-precision architecture primarily comprises an input, a CNN, LSTM, and a fully connected layer. The input layer loads time-series data as the input and transfers them to a CNN for the extraction of high-level features (the features extractor in Figure 1). The output is then sent to an LSTM recurrent neural network (the regressor in Figure 1) and then to a fully connected layer, which then generates the forecast for container throughput. The operations of the layers are briefly introduced in the subsequent sections. Deep learning has yielded promising results in many research areas and has attracted considerable academic and industry attention [16]. Deep learning has considerable potential for use in several applications, such as supervised (classification) and unsupervised learning (clustering) tasks, natural language processing, dimensionality reduction, biology, medicine, and motion modeling [17][18][19][20][21][22]. The advantage of deep learning is that it creates a mixture function set of numerous nonlinear transformations and useful expressions to produce more abstraction and consequently more benefits [17].
None of the aforementioned studies have involved considering a mixed-precision neural architecture to forecast container throughput. In this study, the univariate time-series forecasting method based on a mixed-precision neural architecture was proposed to forecast future container throughput.

Proposed Method
Among deep learning techniques, LSTM networks and CNNs are probably the most popular, efficient, and widely used deep learning techniques [23]. These models are typically used for time series because the appropriate use of different networks can achieve superior results. The LSTM model is inherently designed with a special storage mechanism that can effectively capture sequential schema information [24]. CNNs achieve superior data noise to extract important features and increase the performance of the final prediction model. However, a typical CNN is suitable for processing spatial autocorrelation data but cannot correctly handle complex and long-term time dependence [17]. Therefore, considering the strengths of the two deep learning techniques, we proposed a novel mixed-precision architecture based on the CNN-LSTM hybrid neural network that can better handle nonlinear systems and enhance the forecasting performance. The model architecture is presented in Figure 1. The proposed mixed-precision architecture primarily comprises an input, a CNN, LSTM, and a fully connected layer. The input layer loads time-series data as the input and transfers them to a CNN for the extraction of high-level features (the features extractor in Figure 1). The output is then sent to an LSTM recurrent neural network (the regressor in Figure 1) and then to a fully connected layer, which then generates the forecast for container throughput. The operations of the layers are briefly introduced in the subsequent sections.   Given a sequence of numbers for a time-series data set, to apply machine learning methods to the data set, we restructure the time-series data as a supervised learning problem. Hence, the first layer of the network restructures time-series data as a supervised learning problem by using the value at the previous time step to predict the value at the next time step.

CNN Layer
A classic CNN is composed of two specially designed data preprocessing, convolutional, and pooling layers. Their characteristics can filter input data and extract useful information. In general, the fully connected network layer followed the CNN and received the filtered input data as input data. The goal of the convolution layer is to use convolution operations to process image data to generate new feature values [16]. A convolution kernel (filter) is a unit used for convolution operation. The form of a filter can be considered the size of an area formed by a matrix, and this area slides to spread over the input data. The feature values specified by the coefficient value and the size of the applied filter are obtained on convolution. By applying different filters to the input data, multiple convolved feature maps can be generated. These feature maps are typically more useful than the original input data and considerably improve the performance of the model.
In CNNs, the convolution layer is followed by a pooling layer. This technique is a subsampling technique that extracts specific values from features maps and generates a low-dimensional matrix. Similar to the convolution layer operation, the pooling layer uses a small sliding window, which takes the value of each side of the convolution feature as the input and outputs a new value. The value is typically max pooling and average pooling, which are the maximum and average values of each moving range, respectively. The new matrix generated by the pooling layer can be considered a concise version of the convolution features generated by the convolution layer. Pooling can improve the system and make it more robust because small changes in the input do not change the output value of the pool.

LSTM Layer
For effective extraction of the temporal relationship between time lags, an LSTM layer was established followed the CNN layer. To address the vanishing gradient concerns of a recurrent neural network, the memory scheme in the LSTM network comprises three gate types: forget gates, input gates, and output gates [24]. The operation of the LSTM is illustrated in Figure 2. Table 1 presents a summary of the parameters used in the LSTM layer. Given a sequence of numbers for a time-series data set, to apply machine learning methods to the data set, we restructure the time-series data as a supervised learning problem. Hence, the first layer of the network restructures time-series data as a supervised learning problem by using the value at the previous time step to predict the value at the next time step.

CNN layer
A classic CNN is composed of two specially designed data preprocessing, convolutional, and pooling layers. Their characteristics can filter input data and extract useful information. In general, the fully connected network layer followed the CNN and received the filtered input data as input data. The goal of the convolution layer is to use convolution operations to process image data to generate new feature values [16]. A convolution kernel (filter) is a unit used for convolution operation. The form of a filter can be considered the size of an area formed by a matrix, and this area slides to spread over the input data. The feature values specified by the coefficient value and the size of the applied filter are obtained on convolution. By applying different filters to the input data, multiple convolved feature maps can be generated. These feature maps are typically more useful than the original input data and considerably improve the performance of the model.
In CNNs, the convolution layer is followed by a pooling layer. This technique is a subsampling technique that extracts specific values from features maps and generates a low-dimensional matrix. Similar to the convolution layer operation, the pooling layer uses a small sliding window, which takes the value of each side of the convolution feature as the input and outputs a new value. The value is typically max pooling and average pooling, which are the maximum and average values of each moving range, respectively. The new matrix generated by the pooling layer can be considered a concise version of the convolution features generated by the convolution layer. Pooling can improve the system and make it more robust because small changes in the input do not change the output value of the pool.

LSTM Layer
For effective extraction of the temporal relationship between time lags, an LSTM layer was established followed the CNN layer. To address the vanishing gradient concerns of a recurrent neural network, the memory scheme in the LSTM network comprises three gate types: forget gates, input gates, and output gates [24] The operation of the LSTM is illustrated in Figure 2. Table 1 presents a summary of the parameters used in the LSTM layer. is the input, ℎ is the previous cell output, is the output, is the memory of the previous cell, ℎ is the current cell output, is the current cell memory, is the forget gate, is the input gate, and is the output gate. Here, x t−1 is the input, h t−1 is the previous cell output, y t is the output, C t−1 is the memory of the previous cell, h t is the current cell output, C t is the current cell memory, F t is the forget gate, I t is the input gate, and O t is the output gate.
Input bias vector for the forget gate, input gate, output gate, and memory cell Gates F t Forget gate: responsible for determining whether the old information is remembered I t Input gate: responsible for determining whether the input is adopted O t Input gate: responsible for determining whether the output is adopted The parameters need to leaning from LSTM y Ground truth valuê y Predicted value y i Denotes 12 month container throughput data that must be predicted, where i = 1, 2, . . . , 12 The gates, hidden outputs, and cell states can be represented as follows [25]: The mean squared error loss is calculated as follows: Mathematics 2020, 8, 1784 The loss function is represented as follows: where σ is the sigmoid function, which is an activation function that converts an input value into a value between 0 and 1. Moreover, τ is the Tanh function that converts the output value to a value in the range of −1 to 1. Then, the value can be used to determine which specific information should be excluded or retrained. The terms x t , h t , and C t denote the input value, hidden value, and cell state, respectively, at time step t, respectively; the terms h t−1 and C t−1 are the hidden value and cell state at time step t − 1, respectively. The legends ⊗ and ⊕ are pointwise multiplication and addition, respectively.
Here, U f , U i , U c , and U o are their input weights; W f , W i , W c , and W o are their recurrent weights; and b f , b i , b c and b o are their biases, which must be determined through learning in training processes. The three stages of LSTM to control three gates are illustrated in Figure 1. First, for the previous node h t−1 , x t , content selected for forgetting is generated through (1) for the calculation of F t , in which 0 represents complete erasure and 1 represents complete retention. Next, the system must update the cell status with (5). Eventually, the final output is confirmed using (8) and (9). The output of LSTM is passed as the input of fully connection layer, then generate the final container throughput forecasts. The unfolded chain structure of LSTM in an input flow [x 1, x 2 , . . . , x k ] is depicted in Figure 3, h j , c j where j = 1, 2, . . . k. h j is the hidden state at time j and c_j is the cell state at time j.
In the recurrent regression, the LSTM unit uses previous state h j − 1 , c j − 1 and compute container throughput y j where j = 1, 2, . . . k. Using this method, the past data can be transferred recursively in the whole loop of LSTM.
where is the sigmoid function, which is an activation function that converts an input value into a value between 0 and 1. Moreover, is the ℎ function that converts the output value to a value in the range of −1 to 1. Then, the value can be used to determine which specific information should be excluded or retrained. The terms , ℎ , and denote the input value, hidden value, and cell state, respectively, at time step , respectively; the terms ℎ and are the hidden value and cell state at time step − 1, respectively. The legends ⊗ and ⊕ are pointwise multiplication and addition, respectively. Here, , , , and are their input weights; , , , and are their recurrent weights; and , , and are their biases, which must be determined through learning in training processes.
The three stages of LSTM to control three gates are illustrated in Figure 1. First, for the previous node ℎ , , content selected for forgetting is generated through (1) for the calculation of , in which 0 represents complete erasure and 1 represents complete retention. Next, the system must update the cell status with (5). Eventually, the final output is confirmed using (8) and (9). The output of LSTM is passed as the input of fully connection layer, then generate the final container throughput forecasts. The unfolded chain structure of LSTM in an input flow , , … , is depicted in Figure  3, ℎ , where = 1,2, … . ℎ is the hidden state at time j and _j is the cell state at time j. In the recurrent regression, the LSTM unit uses previous state ℎ , and compute container throughput where = 1,2, … . Using this method, the past data can be transferred recursively in the whole loop of LSTM.

Baseline Models
This section introduces the baseline models, including statistical learning and machine learning methods.

Random Forest Regression (RFR)
RF is an ensemble method in machine learning for classification and regression problems [26]. Typical RF implementation incorporates some set of binary decision trees. Forecasting is conducted using most of the trees (in the classification) to vote or to average their outputs (in the regression). In

Baseline Models
This section introduces the baseline models, including statistical learning and machine learning methods.

Random Forest Regression (RFR)
RF is an ensemble method in machine learning for classification and regression problems [26]. Typical RF implementation incorporates some set of binary decision trees. Forecasting is conducted using most of the trees (in the classification) to vote or to average their outputs (in the regression). In this paper, container throughput forecasting is a regression problem. Therefore, RFR was selected as a baseline model.

Linear Regression (LR)
LR is a standard statistical technique, which enables researchers to investigate a relationship considering the roles that multiple independent variables play in variance in a single dependent variable [27]. The aim of the model fitting process in LR is to identify the coefficients W = (w 1 , . . . , w n ) that can minimize the residual sum of squares between the observed samples in the data set.

Adaptive Boosting (AdaBoost)
In AdaBoost, a boosting method proposed by Freund and Schapire [28], the samples misclassified by a previous basic classifier were strengthened, and the weighted total samples were used to train the next basic classifier. A novel weak classifier was added in each round until the error rate was sufficiently low to satisfy a predetermined threshold or until a predetermined maximum number of iterations is performed.

SVR
Vapnik et al. proposed support vector machines (SVMs) to address classification problems [29]. For decades, SVMs have delivered promising results in several research areas such as solar forecasting, ECG signal preprocessing, and information security [30]. Drucker proposed SVR as an SVM-based solution for regression problems. The basic principle in SVR is to calculate a nonlinear function that maps training data to a high-dimensional feature space [31].

Rolling Forecasting
Rolling forecasting is used as a forecasting scheme and for time-order maintenance in time-series models and machine learning models. In this scheme, a fixed window was used, and the value was updated in the fixed window with each newly predicted value. This process involved removing historical data and adding future data so that the fixed window always retained the same amount of time-series data. First in/first out (FIFO) is an updating strategy in rolling forecasting; this type of strategy is referred to as a continuous or recursive strategy. An example of FIFO for rolling forecasting is displayed in Figure 4.
Mathematics 2020, 8, x FOR PEER REVIEW 7 of 18 this paper, container throughput forecasting is a regression problem. Therefore, RFR was selected as a baseline model.

Linear Regression (LR)
LR is a standard statistical technique, which enables researchers to investigate a relationship considering the roles that multiple independent variables play in variance in a single dependent variable [27]. The aim of the model fitting process in LR is to identify the coefficients = ( , … , ) that can minimize the residual sum of squares between the observed samples in the data set.

Adaptive Boosting (AdaBoost)
In AdaBoost, a boosting method proposed by Freund and Schapire [28], the samples misclassified by a previous basic classifier were strengthened, and the weighted total samples were used to train the next basic classifier. A novel weak classifier was added in each round until the error rate was sufficiently low to satisfy a predetermined threshold or until a predetermined maximum number of iterations is performed.

SVR
Vapnik et al. proposed support vector machines (SVMs) to address classification problems [29]. For decades, SVMs have delivered promising results in several research areas such as solar forecasting, ECG signal preprocessing, and information security [30]. Drucker proposed SVR as an SVM-based solution for regression problems. The basic principle in SVR is to calculate a nonlinear function that maps training data to a high-dimensional feature space [31].

Rolling Forecasting
Rolling forecasting is used as a forecasting scheme and for time-order maintenance in time-series models and machine learning models. In this scheme, a fixed window was used, and the value was updated in the fixed window with each newly predicted value. This process involved removing historical data and adding future data so that the fixed window always retained the same amount of time-series data. First in/first out (FIFO) is an updating strategy in rolling forecasting; this type of strategy is referred to as a continuous or recursive strategy. An example of FIFO for rolling forecasting is displayed in Figure 4.

Parameter Settings
The predictive capability of a forecasting method depends on whether appropriate parameters are used to formalize the model. For machine learning approaches, such as RFR, ADABoost and MLP, the default parameter settings in scikit-learn were adopted [32]. For the statistical learning method SVR, the kernel function radial basis function is used. However, the initial parameters are crucial for the performance of SVR. Furthermore, a grid search method is used to determine the appropriate parameter settings for C and . SVR increments the parameters exponentially; thus, the search spaces of SVR are set as C = [2 0 , 2 1 , …, 2 22 Observed values Currently predicted value Sample Series

Parameter Settings
The predictive capability of a forecasting method depends on whether appropriate parameters are used to formalize the model. For machine learning approaches, such as RFR, ADABoost and MLP, the default parameter settings in scikit-learn were adopted [32]. For the statistical learning method SVR, the kernel function radial basis function is used. However, the initial parameters are crucial for the performance of SVR. Furthermore, a grid search method is used to determine the appropriate parameter settings for C and σ. SVR increments the parameters exponentially; thus, the search spaces of SVR are set as C = [2 0 , 2 1 , . . . , 2 22 ], σ = [2 −10 , 2 −9 , . . . , 2 0 ].
Evaluation of search parameters for deep learning can be considerably expensive-the predetermined architecture and parameter settings were determined empirically. The CNN-LSTM model consists of two convolutional layers of 64 and 128 filters with sizes of 2, respectively. They were followed by one max pooling layer with a size of 2, a LSTM layer of 100 units, a fully connected layer of 32 neurons, and a neuron as the output layer. The model was trained using the Adam optimizer [33] and a ReLU activation function [34] for 1000 epochs with early stopping. Each data set was divided into a training set and testing set. The training set used for training the models; it ranged from 2001 of monthly data until the end of 2018. The testing set was used for testing the forecast accuracy; it consisted of the monthly data for 2019.

Forecasting Performance Criteria
Root mean square error (RMSE) and mean absolute percentage error (MAPE) are two well-known statistical measures. These measures were performed in the study for assessing the deviation of forecasted values from actual values. RMSE is obtained by first summing the squares of the residual and then square root of the value. This indicator is more sensitive to outliers than MAPE. MAPE divides each error value by the actual value, which results in a tilt: if the actual value at a certain time point is low but the error is large, the MAPE value will be greatly affected. RMSE and MAPE are described in (10) and (11), respectively.
where y i represents the actual value,ŷ i the forecasted value, and N the sample size. Lower RMSE and MAPE values exhibit higher forecasting accuracy. Mean absolute scale error (MASE) is a scale-free error measure that compares each error with the average error at baseline. The advantage of MASE is that it never represents an undefined or infinitely large value. Hence, MASE is a suitable metric for intermittent demand series when there is zero demand in the forecast. MASE is described in (12).
where e j is the forecast error for a given period J, which is the number of forecasts. e j = y j − f j , y j is the ground truth value, and f j is the forecast value. The denominator is the mean absolute error of the one-step "naive forecast method" on the training set (t = 1, 2, . . . , n.) [35].
Symmetric mean absolute percentage error (SMAPE) is an accuracy metric based on percentage (or relative) errors. SMAPE is defined as follows [36]: where y i represents the ground truth value,ŷ i denotes the predicted value, and N denotes the sample size. The absolute difference between y i andŷ i is multiplied by 2 and divided by the absolute values of the ground truth y i and the predicted valueŷ i . The value is summed for every fitted sample, which is then divided by N. Chen [36] used absolute values in the denominator of the SMAPE. equation to avoid obtaining a negative value; this approach was also used by Hyndman and Koehler [37].
Mean arctangent absolute percentage error (MAAPE) is used when ground truth is zero. MAAPE is defined as follows: where y i represents the ground truth value,ŷ i denotes the predicted value, and N denotes the sample size. Unlike the regular absolute percentage error (APE), for MAAPE, the arctangent absolute error approaches π/2 when division by zero occurs. Mean bounded relative absolute error (MBRAE) is derived from two measures, relative absolute error (RAE) and bounded RAE (BRAE) [36], and is defined as follows: where e * t = y i −ŷ * i , whereŷ * i denotes the forecast at time t by using the benchmark method. Because RAE has no upper bound, it can be excessively large or undefined when e * t is small or equal to zero. This problem can be easily addressed through the addition of |e t | to the denominator of the RAE equation, which introduces bounded RAE (BRAE), defined as follows: Based on BRAE, the metric MBRAE can be defined as follows: Each measure has its characteristics-the use of multiple measures is helpful to understand the performance of the model in all aspects.

Data Source
The container throughput data of five ports-Anping, Hualien, Kaohsiung, Taichung, and Suao-in Taiwan were used in the study. The seasonal adjusted monthly levels were retrieved from Taiwan International Ports Corporation. Because of limited data availability, the samples were collected from 2001M1 to 2019M12. Container throughput capacity is defined as the estimated total cargo that can be processed [2]. For containers, it can be expressed as twenty-foot equivalent unit (TEU) containers in a period. The container throughput data are displayed in Figure 5. As illustrated in the figure, the container throughput series of all ports exhibited a different time trend. Kaohsiung port throughput exhibited a steady trend, Hualien, and Suao ports revealed downward trends, Anping port's throughput grew from 2001 to its peak in 2010, and then returned to similar throughput in 2001. Overall, each series of data exhibited varying degrees of variations, which hampered forecasting based on fundamentals. Table 2 presents the descriptive statistics of monthly container throughput for the five ports from 2001 to 2019, in all 228 samples for each port were collected. The table displayed the following information of the ports: maximum (Max), minimum (Min), mean, median, first quartile (Q1), third quartile (Q3), interquartile range (IQR), standard deviation (SD), and coefficient of variance (CV, %). For container throughput in Table 1, Anping exhibited the highest degree of variation in terms of its CV value (approximately 81.4%). Taichung exhibited the lowest CV value (19.2%) of all ports, implying a low variability in the container throughput of Taichung. These results revealed that the five ports exhibited different variations and trends. The throughput data are visualized in the form of the box and whisker plots in Figure 6 to present an informative representation of data distribution. In particular, the middle line of the box is the median. The median splits the data set into a bottom half and a top half. The bottom line of the Overall, each series of data exhibited varying degrees of variations, which hampered forecasting based on fundamentals. Table 2 presents the descriptive statistics of monthly container throughput for the five ports from 2001 to 2019, in all 228 samples for each port were collected. The table displayed the following information of the ports: maximum (Max), minimum (Min), mean, median, first quartile (Q1), third quartile (Q3), interquartile range (IQR), standard deviation (SD), and coefficient of variance (CV, %). For container throughput in Table 1, Anping exhibited the highest degree of variation in terms of its CV value (approximately 81.4%). Taichung exhibited the lowest CV value (19.2%) of all ports, implying a low variability in the container throughput of Taichung. These results revealed that the five ports exhibited different variations and trends. The throughput data are visualized in the form of the box and whisker plots in Figure 6 to present an informative representation of data distribution. In particular, the middle line of the box is the median. The median splits the data set into a bottom half and a top half. The bottom line of the box is the median of the bottom half or Q1. The top line of the box represents the median of the top half or Q3. The whiskers (vertical lines) extend from the ends of the box to the minimum value and maximum value, respectively. The IQR is defined as the distance between the Q1 and Q3: IQR = Q3 − Q1. The data points are considered are outliers if they exceed 1.5 times the IQR below the Q1 or 1.5 times the IQR above Q3. The container throughput of five ports in order is Kaohsiung, Taichung, Hualien, Suao, and Anping. The maximum and minimum, Q1, Q3 mean, median, and outliers for each port are displayed in Figure 6. box is the median of the bottom half or Q1. The top line of the box represents the median of the top half or Q3. The whiskers (vertical lines) extend from the ends of the box to the minimum value and maximum value, respectively. The IQR is defined as the distance between the Q1 and Q3: IQR = Q3 − Q1. The data points are considered are outliers if they exceed 1.5 times the IQR below the Q1 or 1.5 times the IQR above Q3. The container throughput of five ports in order is Kaohsiung, Taichung, Hualien, Suao, and Anping. The maximum and minimum, Q1, Q3 mean, median, and outliers for each port are displayed in Figure 6.

Comparison of Models
The metrics MAPE, RMSE, MASE, SMAPE, MAAPE, and MBRAE were used for each method to determine the predictive capabilities of the two approaches for developing forecasting models: The results are listed in Table 3. CNN-LSTM demonstrated performance superior to that of the other methods.

Comparison of Models
The metrics MAPE, RMSE, MASE, SMAPE, MAAPE, and MBRAE were used for each method to determine the predictive capabilities of the two approaches for developing forecasting models: The results are listed in Table 3. CNN-LSTM demonstrated performance superior to that of the other methods.
The advantage of CNN-LSTM over time-series models is that CNN-LSTM does not determine whether data are stationary and these models do not consider whether other statistical tests should be used. The characteristics associated with the training data were learned through CNN-LSTM. Furthermore, CNN-LSTM outperformed other machine learning methods in handling forecasting problems in most cases. The overall forecasting of the various models is presented in Figure 7. The histogram in the figure displays the mean MAPE and mean RMSE for exports and imports in each model. The MAPE and RMSE averages of the CNN-LSTM model were lower than those of other models, which indicates that the CNN-LSTM model outperformed the other models.
Diebold and Mariano [38] and Derrac et al. [39] revealed that the Wilcoxon signed-rank test [40] and the Friedman test [41] are reliable statistical benchmarks. These benchmarks have been widely applied in studies of machine learning models [42][43][44]. We used both methods to compare the forecasting accuracy of the applied CNN-LSTM and the performances of the LR, AdaBoost, RFR, and SVR. Both statistical tests were simultaneously implemented with a significance level α = 0.05. The results in Table 3 revealed that the CNN-LSTM model could provide a significantly better outcome in terms of forecasting performance than other models. The outcomes obtained through metrics and statistical tests represent the performance of different models. In Figure 8A, the x axis denotes the time (January-December, 2019), and the y axis denotes container throughput, presented in thousands of twenty-foot equivalent units (TEU). The cross-trend line denotes the ground truth. Prediction outcomes of different models are indicated as follows: CNN-LSTM (triangle), SVR (circle), RFR (plus), AdaBoost (rectangle), and LR (diamond). We visualized the forecasted value and the actual value to understand the difference between the predicted and actual The outcomes obtained through metrics and statistical tests represent the performance of different models. In Figure 8a, the x axis denotes the time (January-December, 2019), and the y axis denotes container throughput, presented in thousands of twenty-foot equivalent units (TEU). The cross-trend line denotes the ground truth. Prediction outcomes of different models are indicated as follows: CNN-LSTM (triangle), SVR (circle), RFR (plus), AdaBoost (rectangle), and LR (diamond). We visualized the forecasted value and the actual value to understand the difference between the predicted and actual trends. The subfigures (a) Kaohsiung, (b) Hualien, (c), Anping, (d) Suao, and (e) Taichung highlight that the value predicted by CNN-LSTM (denoted by the red line) is the closest to the actual value and trend. Although other methods yielded similar trends, they were far from the actual trend. The data distribution of residuals (the deviation between the actual and predicted value) is visualized as a box and whisker plot (Figure 9) to distinguish the model's predictive performance. The box and whisker plot helps observe the data distribution and the distance between median and zero in the residual distribution. The smaller box implies that the residuals were distributed in a more concentrated manner. A median (middle line) closer to 0 in the y axis indicates a more accurate prediction by the model. In Figure 9, the residual distribution of CNN-LSTM (the box on the extreme right) is smaller and more concentrated than that of other methods. Thus, the forecasting capability of CNN-LSTM method is more stable. Furthermore, the median of the CNN-LSTM model is closer to 0 except Kaohsiung, which indicated that the percentage of residuals of the CNN-LSTM model is also smaller than others in most cases. Therefore, the forecasting capability of the CNN-LSTM is more The data distribution of residuals (the deviation between the actual and predicted value) is visualized as a box and whisker plot (Figure 9) to distinguish the model's predictive performance. The box and whisker plot helps observe the data distribution and the distance between median and zero in the residual distribution. The smaller box implies that the residuals were distributed in a more concentrated manner. A median (middle line) closer to 0 in the y axis indicates a more accurate prediction by the model. In Figure 9, the residual distribution of CNN-LSTM (the box on the extreme right) is smaller and more concentrated than that of other methods. Thus, the forecasting capability of CNN-LSTM method is more stable. Furthermore, the median of the CNN-LSTM model is closer to 0 except Kaohsiung, which indicated that the percentage of residuals of the CNN-LSTM model is also smaller than others in most cases. Therefore, the forecasting capability of the CNN-LSTM is more stable and robustness than other models.
convolutional neural network and long short-term memory.
The data distribution of residuals (the deviation between the actual and predicted value) is visualized as a box and whisker plot (Figure 9) to distinguish the model's predictive performance. The box and whisker plot helps observe the data distribution and the distance between median and zero in the residual distribution. The smaller box implies that the residuals were distributed in a more concentrated manner. A median (middle line) closer to 0 in the y axis indicates a more accurate prediction by the model. In Figure 9, the residual distribution of CNN-LSTM (the box on the extreme right) is smaller and more concentrated than that of other methods. Thus, the forecasting capability of CNN-LSTM method is more stable. Furthermore, the median of the CNN-LSTM model is closer to 0 except Kaohsiung, which indicated that the percentage of residuals of the CNN-LSTM model is also smaller than others in most cases. Therefore, the forecasting capability of the CNN-LSTM is more stable and robustness than other models.  In the model training process, a validation set can be used to evaluate whether model overfitting occurred. Generally, after the performance of the validation set becomes stable, if the training continues, the loss of the training set will continue to improve, but the efficiency of the loss of the validation set will deteriorate, which indicates that the model is overfitting. Hence, the validation set can also be used to determine when to stop training. We considered 10% of the training set as the validation set. Figure 10 shows that the trends of the validation set of the container throughput data are stable according to the training result obtained using the proposed method. In most cases (Kaohsiung (Figure 10a), Hualien (Figure 10b), Suao (Figure 10d), and Taichung (Figure 10e)), the trend of the validation set was stable before 100 epochs. For Anping (Figure 10c), the trend of the validation set was stable at approximately 600 epochs.
This study revealed that the mixed-precision model based on CNN and LSTM has a superior predictive capability for container throughput data. The experiment results revealed that the CNN and LSTM layer led to the superior performance of the mixed-precision model. Therefore, using a convolutional layer for the feature extraction of time-series data is effective, and modeling the time-dependent relationship through LSTM is a feasible solution. However, for throughput prediction, the method proposed in this study is limited to univariate time-series data. To address the more complex multidimensional data, a more complex feature extraction architecture must be constructed.
In the future, researchers should consider learning from the various convolution layers of image processing, such as dilated convolution, depth separable convolution, and deconvolution, which might further improve the model performance.
can also be used to determine when to stop training. We considered 10% of the training set as the validation set. Figure 10 shows that the trends of the validation set of the container throughput data are stable according to the training result obtained using the proposed method. In most cases (Kaohsiung [ Figure 10a], Hualien [ Figure 10b], Suao [ Figure 10d], and Taichung [ Figure 10e]), the trend of the validation set was stable before 100 epochs. For Anping (Figure 10c), the trend of the validation set was stable at approximately 600 epochs. This study revealed that the mixed-precision model based on CNN and LSTM has a superior predictive capability for container throughput data. The experiment results revealed that the CNN and LSTM layer led to the superior performance of the mixed-precision model. Therefore, using a convolutional layer for the feature extraction of time-series data is effective, and modeling the timedependent relationship through LSTM is a feasible solution. However, for throughput prediction, the method proposed in this study is limited to univariate time-series data. To address the more complex multidimensional data, a more complex feature extraction architecture must be constructed. In the future, researchers should consider learning from the various convolution layers of image processing, such as dilated convolution, depth separable convolution, and deconvolution, which might further improve the model performance.

Conclusions and Future Works
The forecast for container throughput is a crucial issue in the operation of the port, and there is still room for improvement in existing forecasting methods. A mixed-precision neural architecture for container throughput forecasting was proposed in this study. Our mixed-precision neural architecture combined the strengths of CNNs and LSTM. The matrix operations in the CNN-LSTM model combined the abstracted feature and modeling time dependences from both the CNN and LSTM models and exhibited more comprehensive feature extraction for overall data to yield enhanced forecasting performance compared with either the CNN model or the LSTM model. Our

Conclusions and Future Works
The forecast for container throughput is a crucial issue in the operation of the port, and there is still room for improvement in existing forecasting methods. A mixed-precision neural architecture for container throughput forecasting was proposed in this study. Our mixed-precision neural architecture combined the strengths of CNNs and LSTM. The matrix operations in the CNN-LSTM model combined the abstracted feature and modeling time dependences from both the CNN and LSTM models and exhibited more comprehensive feature extraction for overall data to yield enhanced forecasting performance compared with either the CNN model or the LSTM model. Our mixed-precision neural architecture offers the advantage of simultaneously combining feature extraction in the CNN-LSTM model and providing precise time dependences results. The use of multidimensional and high-frequency data in port development has received considerable interest recently. Future studies can further explore the advances in neural network architectures to process multivariate time-series data and provide more comprehensive multistep throughput forecasting.