Deep Learning Forecasting for Supporting Terminal Operators in Port Business Development

: Accurate forecasts of containerised freight volumes are unquestionably important for port terminal operators to organise port operations and develop business plans. They are also relevant for port authorities, regulators, and governmental agencies dealing with transportation. In a time when deep learning is in the limelight, owing to a consistent strip of success stories, it is natural to apply it to the tasks of forecasting container throughput. Given the number of options, practitioners can beneﬁt from the lessons learned in applying deep learning models to the problem. Coherently, in this work, we devise a number of multivariate predictive models based on deep learning, analysing and assessing their performance to identify the architecture and set of hyperparameters that prove to be better suited to the task, also comparing the quality of the forecasts with seasonal autoregressive integrated moving average models. Furthermore, an innovative representation of seasonality is given by means of an embedding layer that produces a mapping in a latent space, with the parameters of such mapping being tuned using the quality of the predictions. Finally, we present some managerial implications, also putting into evidence the research limitations and future opportunities.


Introduction
In recent years, containerisation and ports have played important roles in international trade [1,2] for supporting freight traffics efficiently in worldwide competition [3,4].Today, nearly all overseas shipments of products such as furniture, toys, footwear, clothing, auto parts, electronic components, and computers are usually made through containers on cargo ships.Furthermore, the amount of food goods shipped in refrigerated containers is also constantly growing [1], and this process has greatly stimulated the business development of terminal operators [5].Therefore, shipping companies are using larger and larger ships that allow the transport of about 24,000 TEUs, minimizing unit transport costs and developing maritime economics of scale [6] to efficiently manage logistics activities both in port handling and in container storage on vessels through adoption of the so-called model of naval gigantism [7].
In this context, the importance of optimal management and planning of ports' development is unquestionable [8].In particular, construction costs represent sunk costs for a port, and to ensure that these investments in the development of ports will be profitable, the number and layout of the terminals [9,10] in the port must be appropriately commensurate with the volume of container transport in the port itself [11].It is therefore evident that, to avoid waste, port managers must be able to accurately predict the container transport volumes [12,13], as this is essential for planning and organizing port activity [14] and is also relevant for government departments dealing with transport at both the micro and macro levels [15].
The demand forecast for container traffic is a crucial success factor in the business development of port logistics which allows efficient management of resources by both container terminal operators (i.e., concessionaires) and port authorities.Moreover, the strategic management of these maritime logistic activities depends on port governance models [16] as well as on the level of vertical integration of container transport processes between terminal operators and shipping companies [17,18].In this perspective, port authorities must forecast-through some concession clauses and the setting of target performance-the achievement of economic and operational outcomes by terminal operators [19], measurable in terms of traffic demand, logistics effectiveness, and efficiency managed by the "landside".Therefore, container terminal operators manage and plan critical logistic activities in accordance with port authority marketing strategies [20,21] to support port competitiveness in the worldwide scenario [4,22].Technological advances-such as digitalisation and the Internet of Things-offer clear opportunities to streamline and synchronise operations, increase efficiency, and improve productivity toward the building of the so-called smart port [23,24].However, automation and port call optimisation involve digital investments, both for initial deployment and for keeping abreast of recent advances, and these investments are linked to traffic volume monitoring to support the port's competitive advantage [25].
Coherent with the aforesaid statements, this study comes at an opportune time when substantial variation can be observed in containerised shipment volumes, owing to factors such as trade tensions, protectionism, regionalisation of supply chains, the availability of mega container ships, and last but not least, the global pandemic [26].In a scenario filled with uncertainty, port managers have to carefully rethink their development plans.Regulators and port authorities are also facing the challenges of a changing environment, especially insofar as they grant container terminal concessions.Finally, there is a finegrained model capable of forecasting several categories of container traffic fits separately but simultaneously with the efforts to reduce carbon emissions due to shipping and the increased attention ports are giving to landside operations and logistics.

Related Work
Forecasting the container transport volume is a hot topic in the maritime transportation literature, and several approaches have been developed to improve the predictions.In this section, a brief review of the literature is conducted.
Given the time-dependent nature of the phenomenon, models typical of time series analysis have been applied.The autoregressive integrated moving average (ARIMA) [27] models represent the most common approach, and some interesting empirical studies that apply these models to container transport volume forecasting can be found in the work by Chan et al. [28].
Other authors have highlighted the presence of seasonal fluctuations in the container transport volume data [29,30] and suggested the use of models able to take these effects into account.In this direction, the Seasonal-ARIMA (SARIMA) models appeared to be the natural extension of the traditional ARIMA models.Schulze and Prinz [31] applied SARIMA models to predict the container transport volume in German ports.
The main limitation of the ARIMA and SARIMA models is the assumption of linearity, which could make them unable to capture the complexity present in some empirical evidence.The need for more accurate predictions directed scientific interest towards more complex and non-linear models.Some interesting examples are trigonometric regression models [30] and machine learning-based models, such as artificial neural networks (ANNs) [32] and support vector machines (SVMs) [33].Focusing on an intercontinental liner service, Wang and Meng [34] proposed a method for forecasting the number of containers to be transported between two continents.They used piecewise linear regression, autoregression, and artificial neural networks.Using real-case booking data, they found that a combination of the three methods yielded forecasts of satisfactory precision.To cope with changing trends and limited data, Chen and Liu [35] developed a predictive method based on a GM(1,1) grey model and generalised contour lines to forecast Chinese port cargo throughput.Fouries analysis was used in combination with a SARIMA model in an empirical study of cargo and container throughput for the ports of Hong Kong and Kaohsiung.Numerous other approaches, based on decomposition methods or hybrid models, have been proposed.The interested reader may refer to the classic book by Box et al. [27].
The mainstream literature, especially in the last period, has shown a growing interest in machine learning (ML).Barua et al. [36] provided an exhaustive review of the progress in ML for transportation management.The interesting point of forecasting with ANNs is that the optimal model is determined from the data itself.The first paper that employed ANNs in this field is dated 1999, when an ANN-based model was used to predict the volume of container trade in the port of Kaohsiung in Taiwan [37].More recent studies are proposed by Gosasang et al. [38], who analysed data relating to the port of Bangkok, and Milenković et al. [39], who focussed on data from the port of Barcelona.Working on data relating to the port of Shanghai and Shenzhen, Xie et al. [40] proposed a hybrid approach to predicting container trading volume based on the least squares SVR (a regression version of the traditional least squares SVM method) and offered an interesting comparative analysis with some models, including ARIMA and SARIMA models.
Notable new approaches include the one by Wang et al. [41], who discussed a generative model based on an autoencoder coupled with an ANN for forecasting the quality of production, the one by Zhou et al. [42], who incorporated dynamic factors into a recurrent network, and the one by Du et al. [43], who combined variational mode decomposition to isolate multiple seasonal components and an extreme learning machine.Another relatively recent paper proposes a univariate long short-term memory (LSTM) network model to forecast the future volume of container transport [44].LSTMs are a type of recurrent neural network [45] specifically designed to analyse data with sequential structures, and they have recently become extremely popular after the excellent results achieved in various fields such as handwriting and speech recognition and text translation [46,47].
This paper contributes to the literature in two directions.First, we explore two other ML-based algorithms specifically designed for processing sequential data: 1-D convolutional neural networks (CNNs) and gate recurrent unit networks (GRUs).We also introduce in the neural network architectures investigated an additional component that explicitly models the seasonal effect via embedding neural networks.Second, we develop a multivariate forecasting approach able to predict the container traffic not only at the macro level (i.e., the entire port) but also on a micro level (e.g., the specific terminal operator).

Methods
This section briefly describes the neural network (NN) blocks used in this research.In particular, neural networks represent a class of algorithms replicating the human brain's learning process.They are constituted by a set of interconnected units arranged in layers that learn from experiences using training algorithms.The way these neurons are connected (i.e., through the synapses) configures different types of neural networks with different characteristics and properties [48].NNs have been proven to be promising function approximators and are extensively used for complex forecasting tasks.Neural networks represent a promising tool for modelling the dynamics of container traffic volume, which is generally characterised by strong nonlinearity and seasonality effects.This is especially true when modern deep learning networks (DLN) designed explicitly for processing time-series data (such as recurrent and 1D-convolutional data) are considered.The mathematical descriptions of these kinds of neural networks, which were used in this research, are provided in the following subsections.

Fully Connected Neural Networks
In feedforward networks, the information propagates along the synapses from the input to the output layer in only one direction.Each layer transforms the data, and successive transformations are chained to produce a composition of functions.
We denote by x = (x 1 , x 2 , . . ., x d ) ∈ X the vector of features.Mathematically, a q ∈ N-dimensional feedforward (or fully connected) neural network layer can be seen as a vectorial function that maps the training samples in a new space.The coordinates of the data in the new space are the output of the neurons in the layer, and each unit computes the nonlinear function z i (x) as follows: where w i,l ∈ R are coefficients to learn and φ(•) denotes the activation function.The most popular choices of the activation function in the machine learning literature are the sigmoid, hyperbolic tangent, or rectified linear units, where the latter is known as a ramp function outside the machine learning parlance.In matrix form, the output z(x) = (z 1 (x), z 2 (x), . . ., z q (x)) ∈ R q can be also expressed as where If the network has a single layer, the new features z(x) are used directly to calculate the output.Otherwise, multiple layers are staked.The output of each layer represents the input of the next one and so on for the following layers.A network with several layers is called a deep network.Considering a deep network architecture with m ∈ N layers, the mechanism behind can be formalised as follows: where q k indicates the number of units in the k-th layer (with are the coefficients to optimise, and φ k (•) is the activation function applied by the k-th layer.The network learning is carried out by optimising the coefficient matrices and vectors according to a specific loss function.It poses the minimisation of a high-dimensional non-convex function, and the backpropagation algorithm is generally used.The coefficients are iteratively adjusted by minimising the error of the network outputs with respect to some reference values.
An extensive description of neural networks and backpropagation can be found in [32].

Embedding Neural Networks
Sometimes, information is available in the form of categorical variables.In the statistical literature, the standard procedures for dealing with categorical variables are one-hot and dummy encoding.Nevertheless, these coding schemes produce high-dimensional sparse vectors, which could present potential reliability problems in the obtained estimations when there are many categorical features or one of the variables presents many levels.Embedding is an innovative technique to analyse categorical variables that appeared for the first time in the natural language processing context [49].With embedding, the ability of the neural network to learn representations has been used to encode the levels of categorical variables into numerical vectors.The coordinates of the levels of the categorical variables in the new space are the network's parameters learned during the training process.Furthermore, the Euclidean distance in the new real-valued space of the levels indicates that they are similar concerning the target variable.
Let B = {b 1 , b 2 , . . ., b n b } be a discrete set of categories.An embedding layer is a function where q b ∈ N is a hyperparameter defined by the modeller describing the size of each embedding layer.This function is equivalent to applying different q b -dimensional FCN layers to the levels of the categorical variable after one-hot encoding.In this setting, the output of each neuron will be equal to the weight multiplied by the only element equal to one.Consequently, they can be trained in the same way as the other network parameters related to the other layers [50].The output of these layers can be seen as the coordinates of the levels of the categorical variables in the new q b -dimensional space.Furthermore, the Euclidean distance in the new space of the levels indicates that they are similar concerning the target variable.
Denoting with n b = |B| the number of levels of the categorical feature, the number of weights that must be learned in the embedding layer during training is n b × q b .

Recurrent Neural Networks
Recurrent neural networks (RNNs) provide additional synapses that connect neurons cyclically [45].The result of the previous elaborations is reprocessed as input in the following sessions.In this way, previsions are influenced by what has been learnt from the past.The recurrent nature of these architectures makes them powerful tools for processing sequential data [51,52].
Let x (t) ∈ R d , 0 ≤ t ≤ T be multivariate time series.Then, the output of an RNN layer with q ∈ N can be formulated as: where U = {u j,k } 1≤j,k≤q ∈ R q×q represents the weights associated with the recurrent synapses (with an initial value z (0) = 0 q ).The RNN training is carried out using backpropagation thorugh time, which generally converges slowly and has vanishing gradient issues.
In order to overcome these problems, two more sophisticated cell-based architectures were proposed: long short-term memory (LSTM) and gated recurrent unit (GRU) networks.

Long Short-Term Networks
Long short-term memory (LSTM) networks, one of the earliest approaches to RNN architectures with nontrivial state control [46], in addition to the recurrent synapses containing the short-term memory, has an additional memory cell that stores and releases the long-term information through a system of subnets called gates.Combining these two types of memory, LSTMs have achieved excellent results in several fields.Greff et al. [53] offered an exhaustive overview of LSTM architectures.
Mathematically, denoting with W (p) ∈ R q×d , U (p) ∈ R q×q , and w (p) 0 ∈ R q the weights associated to each subnet p ∈ {i, o, f , z}, the output of the LSTM layer with q ∈ N can be described as follows: (3) where σ(•) : R → (0, 1) and tanh(•) : R → (−1, 1) are the sigmoid and the tanh activation function, respectively, and c (t) is the state memory cell at time t (with an initial value c (0) = 0 q ).The forget input and output gates regulate the mechanism of storing and releasing the information.Chronologically, the forget gate is the first gate to act on the memory cell (Equation ( 5)).It uses the sigmoid function to delete a percentage of information considered obsolete.Next, the input gate (Equation ( 3)) adds to the output of the forget gate (Equation ( 7)) the new useful information extracted from the data input.Finally, the state of the memory cell c (t) is combined with the output gate (Equation ( 6)) to compute the output value of the LSTM cell (Equation ( 8)).
The number of weights in an LSTM layer with q units is equal to 4 × q × (d + q + 1).A graphical illustration of the LSTM cell is showed on the left side of Figure 1.

Gated Recurrent Unit Networks
Gated recurrent unit (GRU) networks [54] seem to be one of the most promising alternatives to LSTM.Being recently developed, their architecture does not provide a memory cell but includes two gates called the update gate and reset gate, which govern the flow of information.In fact, GRU represents a simplified version of LSTM and allows avoiding the gradient vanishing using a lower number of parameters.Using the previously adopted notation, the number of weights in a GRU layer with q units equal to 3 × q × (d + q + 1) is lower than an LSTM layer.
Let W (p) ∈ R q×d , U (p) ∈ R q×q , and w (p) 0 ∈ R q be the weights indexed by p ∈ {o, r, c}.The output of the q-dimensional GRU layer can be formalised as follows: The reset gate (Equation ( 10)), using the sigmoid activation function, determines when the hidden state has to ignore the previous hidden state and uses the input data (Equation ( 11)).
The update gate (Equation ( 9)) helps the RNN store long-term information, determining how much of the previous memory should be kept.In fact, it realises a task similar to that performed by the memory cell in the LSTM network.Finally, the output of the GRU cell is computed by combining the outputs of both gates (Equation ( 12)).Generally, a GRU layer contains several units, each having separate reset and update gates.This mechanism is used to learn the dependencies over different time scales.Some units will tend to have reset gates that are frequently active and therefore learning shortterm dependencies, while other units will have most active update gates, and so they will be able to capture longer-term dependencies.
The number of weights in a GRU layer with q units equal to 3 × q × (d + q + 1) is lower compared with the number of parameters in an LSTM layer with the same number of units.A graphical representation of the GRU cell is shown on the right side of Figure 1.

Convolutional Neural Networks
Convolutional neural networks (CNN)s, proposed in [55], are feedforward NNs that make use of the mathematical operator called convolution to extract new feature maps from data.They work by multiplying subsets of the input data matrix by another matrix, called a filter, to extract useful latent features.CNNs typically process multidimensional arrays, and the number of dimensions to which the convolution operator is applied distinguishes 1D, 2D, and 3D convolutional networks.
The main characteristics that differentiate CNNs from the traditional fully connected ones are basically twofold [32].The first is local connectivity, where nodes are only connected to a smaller region of the input, and each component of the feature maps is obtained by processing only a subset of the input data.The second is weights sharing, where each member of the kernel is used at every position of the input, meaning that all the nodes in the output detect exactly the same pattern.
Let x (t) ∈ R d , 0 ≤ t ≤ T be a multivariate time series of input variables.A 1D-CNN layer with filters W (j) = (w j,1 , w j,2 , . . ., w j,m ) ∈ R d×m and biases w j,0 ∈ R for j = 1, . . ., q extracts a new feature map {z s,j (x)} 0≤s≤T+1−m,1≤j≤q from the input data, where each component is a nonlinear function of some input data and is given by where * indicates the convolution operator and •, • denotes the scalar product in R d .It must be remarked that the index j refers to the different filters.Each filter W (j) produces a set of new features.The second index s refers to the different features extracted from a given filter.

The Model
This section describes the neural network models developed.Let S = {s 1 , s 2 , . . .} be the ordered set of the different seasons considered (i.e., quarterly data have |S| = 4 and monthly data |S| = 12) and Y = {y 1 , y 2 , . . .} be the set of years under investigation.We denote by r (t) ∈ R l the multivariate time series of the container traffic volume with a given level of detail at time t.The index t is supposed to vary in T = {t ∈ N : t ≤ |Y × S|} such that each integer t ∈ T uniquely identifies a pair (y, s) ∈ Y × S according to the chronological order of the seasons and the years.
We modelled the function f (•) using neural networks since they are known to be universal approximators [56].Our neural network architecture combines some of the network architecture described in Section 3.1.
The season s is modelled using embedding networks.When choosing an embedding size q s ∈ N, an embedding layer maps where z ∫ (s) is a set of new features that summarises the information concerning the seasonal effect and represents s in a way that is optimised with respect to the desired output in a q s real-valued dimensional space.Information related to the previous |S| time lags of r (t) is processed by neural networks layers specifically designed to process sequential data such as LSTM, GRU, or CNN layers.For illustrative purposes, we will consider the LSTM networks, but an interesting comparison among these different neural networks architecture will be carried out in the numerical experiments section.Let R (t) = {r (t 0 ) } t−|S |≤t 0 <t ∈ R l×|S | be the matrix that contains the previous |S| observations of r (t) .An LSTM layer with q r ∈ N units processes it as illustrated in equations R (t) , and the output is z r (r) = f (R (t) ).
The set of new features z r (r) ∈ R q r summaries the information related to the container traffic volume of the previous |S| time lags.Let z = (z s , z r ) ∈ R q s +q r be the full set of features obtained concatenating the output of the embedding and LSTM layer.The quantity of interest r (t) can be calculated by processing z with a fully connected layer: The embedding weights (z s , ∀s ∈ S), the LSTM weights (W (p) , U (p) , w 0 , ∀p ∈ {r, z, c}), and the weights of the fully connected layer (b 0 and W) must be properly optimised in order to minimise the error of the network outputs with respect to the reference values.A graphical representation of the proposed neural network model is provided in Figure 2.

Data
In order to validate the effectiveness of the proposed methodology, we carried out some experiments on data downloaded from the website of the port of Barcelona.These data were also used in [39].This port is one of the most important maritime logistics ports in the Mediterranean, with a land area of 10.653 square kilometers.Barcelona is the third largest port for container activity in Spain (after Valencia and Algeciras) and ninth in Europe.The port is managed by the Barcelona Port Authority.As one of the main ports in Spain-where it adopted the landlord governance model [16]-the Port of Barcelona is coordinated by Ports of the State, a state-owned organisation responsible for the meta-management of the 28 main Spanish port authorities that are in national seaports of general interest.In the port of Barcelona, there are two terminal operators specialised in container handling: "APM Terminals Barcelona" and "BEST-Barcelona Europe South Terminal".Both these container operators manage international terminals-linked with rail facilities-with more than 3000 m of berthing line and 17 container cranes.First, explanatory data analysis was carried out.Table 1 reports some statistics (mean, coefficient of variation (CoV), and range) of our dataset.We observe that the time series under analysis presented very different magnitudes.For example, the volume of full unloaded domestic containers was the smallest, while the volume of containers in transit was the largest.Figure 3 provides pairwise scatter plots (below the main diagonal) and Pearson correlation coefficients (above the main diagonal) among the time series of interest, arranged in matrix form.From both, we noted moderate positive correlations.The 3 stars next to the correlation coefficients indicate that the corresponding p-values were less than 0.001.This evidence suggests that a multivariate model could exploit the information among the different series and produce more accurate predictions.
Figure 4 shows the box plots for the different container types grouped by month.The red boxes refer to the warm months, while the blue ones represent the cold months.Some common elements can be observed among the different series.First, we observed that container traffic increased in the warm weather months, but this effect was more pronounced in some series and less so in others.q q q q q q q q q q q full_transit empty full_loaded_foreign full_loaded_domestic full_unloaded_foreign full_unloaded_domestic

Results
In this section, we present the results of some numerical experiments carried out on the data described in Section 3.3.A comparative analysis of the out-of-sample accuracy of some models in predicting the container traffic volume is carried out.The aim is to find the model able to more accurately forecast, out of sample, the container traffic volumes.We evaluated and compared the out-of-sample performance of the different methods using the mean absolute percentage error (MAPE): The comparison includes the standard SARIMA models, the deep learning model described in Section 3 based on LSTM and embedding networks, and two variants that use GRU and 1D-convolutional networks (in place of the LSTM network) to process the time-specific component.
Let T be the full set of the time points available.We considered two different forecasting scenarios:

•
In the first scenario, the models were calibrated using data up to December 2018, and forecasts were made for the calendar year 2019.In other words, data were partitioned in the training set, including data related to T (1) train = {t ∈ T : t ≤ 108} (from January 2010 to December 2018) and a testing set T (1) test = {t ∈ T : 108 < t ≤ 120} (from January 2019 to December 2019) for measuring the out-of-sample performance.

•
In the second scenario, the models were calibrated using data up to December 2019, and forecasts were made for the calendar year 2020.In this case, we defined the training set and testing set as T (2) train = {t ∈ T : t ≤ 120} (from January 2010 to December 2019) and T (2) test = {t ∈ T : 120 < t ≤ 132} (from January 2020 to December 2020), respectively.
We treated these two years separately since 2020 presented very different features from other years.
First, we discuss the calibration of all the models investigated.As far as the SARIMA models are concerned, in order to find the most suitable model specification, we used the "auto.arima"algorithm available in the R package "forecast" [57].The "auto.arima" algorithm explores the ARIMA model space, selecting a recommended model based on the Akaike information criterion (AIC) and unit root tests.Table 2 lists the models selected by the "auto.arima"algorithm for the different time series and for both forecasting scenarios.The autoregressive (AR), differentiation (I), and moving average (MA) orders for nonseasonal and seasonal components (denoted by the lower index 12) are reported with the resulting AIC quality index.As expected, in almost all cases, the algorithm selected models that provided seasonal components.Only for the full unlodead domestic data in the 2019 forecasts did the best model results have only non-seasonal terms.
Regarding the deep learning models, some details must be provided.According to machine learning best practices, preliminary pre-processing is performed.The time-specific data concerning the volume of container traffic were scaled to [0, 1] via min-max scaling, while the month names were coded into integer values.We considered the minimum and maximum values in the training set in the min-max scaling.Furthermore, the minimum value were reduced by 10% of the variation range, and the same percentage increases were applied to the maximum value to ensure the testing set data were in [0, 1] after the scaling.In all the architectures investigated, we set the size of the month embedding layer to q s = 5.For the LSTM (GRU) model, we used an LSTM (GRU) layer with 64 units, while in the case of the CONV model, we used 64 filters of a 6 × 6 size.In the latter case, the filter matrices slid along the data matrix backwards along the time dimension.The filter weights were multiplied by the input data matrix (lag values of the container traffic volume) and combined in a non-linear fashion (via an activation function) to extract the optimal features with respect to the quantity of interest.In addition, three different activation functions were considered to process the time-specific component:
The fitting of the neural networks was performed using the Adam optimiser [58], setting the parameters to the default values.The network weights were recursively adjusted, aiming to minimise the mean absolute error between the predicted and actual values.To avoid overfitting, we used leave-out validation by using 10% of the training set as the validation set.The update of the weights was calculated using the remaining 90% of the data, and at each epoch, the loss function was evaluated on the training and validation sets.To avoid overfitting, we saved the weight configuration that produced the lowest value of the loss function on the validation set.This was a set of data that the neural network did not use during training.The performance obtained on this set could be a reliable estimate of the model's forecasting performance.A review of other possible validation techniques can be found in [59].The results of training a neural network are sensitive to some components of the training process: the starting point of the optimisation algorithm, the random sampling of batches of training data employed to calculate the network weight updates, and so on.Then, analysing the results of a single run could be not sufficient to make a judgment.We fit each architecture 10 times and compared the performance on the testing set in terms of accuracy and stability.
Figure 5 shows the box plots of the outsample MAPE for all three networks (LSTM, GRU, and CONV) for the different activation functions.This analysis shows the sensitivity of the different models concerning the starting point of the optimisation algorithm and the random sampling of batches.Some evidence may be noted.First, we observed that the tanh activation (blue boxes) function produced the best performance for all the network architectures investigated.The linear function (red boxes) was the runner-up, while the relu activation (green boxes) produced the lowest out-ofsample accuracy in our tests.Analysis of the variability of the results confirmed this ranking: the tanh activation produced less variable results except for the GRU model, while relu activation was associated with less stable results.According to these results, we can conclude that the tanh activation function appeared to be the best activation function for the investigated phenomenon.
Table 3 compares the results of the SARIMA models against the LSTM, CONV, and GRU models with tanh activation in both forecasting scenarios.For the deep learning models, we considered the ensemble prediction, obtained as the average of the forecasts in 10 different runs.The MAPEs obtained by the different methods were reported at the single time series level and at the aggregate level, where all values were summed.First, we observed that the errors increased dramatically when moving from the 2019 forecasting scenario to the 2020 forecasts.This is quite reasonable, since container traffic volume showed a highly irregular pattern over the last year due to the COVID-19 outbreak.In addition, it is visible that the LSTM and GRU methods produced a lower global MAPE than the SARIMA models in both forecasting scenarios.On the contrary, the CONV model had poorer performances.This probably suggests that the evolution of container traffic can be better modelled by using recurrent neural network architectures than convolutional ones.Considering both scenarios, the GRU model appeared to be the best model.A further comparison between the performances of the GRU and SARIMA models is depicted in Figure 6 using barplots.Looking at the performance on individual series, in the 2019 forecasts, we observed that the improvement induced by the GRU models was especially large for the "full transit" and "empty" container volumes, which were two of the largest series in terms of size.In particular, the training of the deep learning models was carried out simultaneously on all data and pursued the minimisation of the global MAE in the training set.In this setting, it is not surprising that the learning algorithm put more weight on the larger series, which could produce larger residues, than the smaller ones.At the same time, the SARIMA approach appeared to be highly competitive on both "full loaded" series.The comparison between the forecasts obtained by the GRU and SARIMA models in this forecasting scenario is also graphically represented in Figure 7.In the 2020 forecasts, the GRU and LSTM models still overperformed the SARIMA model at the overall level, but the improvement appeared to be smaller.In this scenario, the only case in which the SARIMA approach yielded a lower MAPE than the GRU was the "full transit" container traffic volume.This evidence is in contrast with that observed in the first forecasting scenario.However, when analysing in greater detail the behaviour of the "full transit" container volume traffic, we observed that this series presented a large downward peak in the initial months of 2020 and a rebound in the summer months of 2020.All of the models investigated failed to predict the evolution of the "full transit" series, and this is not surprising given that these troughs were probably due to the COVID-19 outbreak.The reasons why the effects of the pandemic on the predictive ability of models were large in this particular series have yet to be clarified, and we plan to shed some light on this theme in a forthcoming work.Figure 8 shows the standardised residuals obtained by the SARIMA and GRU model, distinguished by month for all the time-series data.We observed that both models presented positive and negative residuals in all months.This evidence suggests that the SARIMA and the GRU model did not have a systematic error component.Furthermore, we note that the amplitude of the errors was generally stable over the different months.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −0.We conclude this paragraph by analysing the representation of the seasonal components learned by the network during the training process.We focus on the GRU model in this part.The embedding layer, as described in Section 3.2, mapped the elements of the set S (the months of the year) in a q s = 5-dimensional space, in which they were represented in an optimal way with respect to the outcome variable.The Euclidean distance of the levels in the new space is a measure of their similarity with respect to the target variable.In our case, the distance measured the similarity of the months to the container traffic volume.Since the visualisation of a month embedded in a five-dimensional space was complex, we performed principal component analysis (PCA) on the embedding features to reduce the dimensionality of the data with a minimum loss of information.In particular, we extracted the first two (of five) components, and the months were mapped into a new two-dimensional space R 5 → R 2 .In our numerical experiments, the first two principal components explained about 95% of the total variability and represented the months of the year in a new bivariate space that allowed a simple graphical representation.Figure 9 plots the months of the year in this reduced-size space.We observed that the months characterised by similar (average) temperatures tended to be close to each other in the new space.In particular, the warm weather months (in red) were concentrated at the new space's bottom, while the cold weather months (in blue) were located in the remaining part of the new space.This suggests that the network recognises similar patterns in data referring to times of the year characterised by a similar climate and assigned similar weights to months with similar characteristics.q q q q q q q q q q −0.05

Conclusions
The availability of multivariate predictive models for container throughput with a breakdown for domestic and foreign freight will be beneficial for ports, shippers, and terminal operators, as well as for national regulators and authorities.Bearing in mind the increased variability and the quick evolution of the scenario, further accelerated as various interventions are implemented to mitigate the pandemic and its fallouts, adding forecasting models based on neural networks to the arsenal of tools available has not only the potential to yield better estimates but to complete and reduce the variation when used in combination with other methods.The ability to rapidly identify trends will have effects on directing efforts towards building capacity, and this applies equally to human resources training and recruitment.Therefore, the adoption of deep learning forecasting models turns out to be useful not only for the container terminal operators (e.g., "APM Terminals Barcelona" and "BEST-Barcelona Europe South Terminal") but also for the public organisations responsible for port management at the national level (e.g., Ports of the State) and local level (e.g., Barcelona Port Authority).In fact, the demand estimation for maritime container services allows an efficient management of ports from the perspective of global competitiveness.Moreover, a correct estimate of these data allows one to strategically sustain the stable development of vertical alliances between container terminal operators and shipping companies.
This paper sheds some light on the potential of predictive models based on neural networks to forecast the volume of containerised freight.The two recurrent neural network models outperformed SARIMA and the CNN in both of the forecasting scenarios analysed, achieving an absolute reduction in the MAPE-for the aggregate data-of about 3 percentage points in the first scenario and 1.5 percentage points in the second.The embedding realised by a neural network showed a clear spatial separation between the warmer and cooler months.The proposed approach can be extended, keeping in mind external regressors so that additional available information can be integrated.Future work will also include a refinement of the embedding used to build a meaningful representation of seasonality so that additional insights can be gained, as well as an investigation of hierarchical models.

Figure 1 .
Figure 1.A pictorial representation of the LSTM unit (left) and GRU unit (right).

Figure 2 .
Figure 2. A graphical diagram of the proposed neural network model.The website of the Port of Barcelona provides multivariate time series data related to container traffic.The dataset includes monthly observations from January 2010 to December 2020 (|S | = 12, |Y | = 11, and T = {t ∈ N : t ≤ 132}) related to the count of containers, which are broken down into six categories: loaded and unloaded foreign, loaded and unloaded domestic, empty, and transit.A separate analysis of the different categories is also especially interesting in light of the reshaping of global supply chains, with a thrust toward regionalisation that is being amplified by attempts to build resilience in response to the COVID-19 outbreak.First, explanatory data analysis was carried out.Table1reports some statistics (mean, coefficient of variation (CoV), and range) of our dataset.We observe that the time series under analysis presented very different magnitudes.For example, the volume of full unloaded domestic containers was the smallest, while the volume of containers in transit was the largest.

Figure 3 .
Figure 3. Pairwise scatter plots of the container traffic volume data.Correlation coefficients are marked with *** if the p-value of them being zero is less than 0.001.

Figure 4 .
Figure 4. Seasonal box plotsof the container traffic volume data.

Figure 6 .
Figure 6.Performance of the SARIMA and GRU models expressed as MAPE for forecasting years 2019 (lighter) and 2020 (darker).

Figure 7 .
Figure 7. Time series predictions of GRU and SARIMA models.

Figure 9 .
Figure 9. Graphical representation of the months embedding z s (s) ∈ R 5 after the dimensionality reduction (from 5 to 2) via principal component analysis.
Standardised residuals of the SARIMA and GRU models for the different months.