A Novel Forecasting System with Data Preprocessing and Machine Learning for Containerized Freight Market

Duan, Yonghui; Zhang, Xiaotong; Wang, Xiang; Fan, Yingying; Liu, Kaige

doi:10.3390/math13101695

Open AccessArticle

A Novel Forecasting System with Data Preprocessing and Machine Learning for Containerized Freight Market

by

Yonghui Duan

¹,

Xiaotong Zhang

^1,*

,

Xiang Wang

²,

Yingying Fan

¹ and

Kaige Liu

¹

College of Civil Engineering and Architecture, Henan University of Technology, Zhengzhou 450001, China

²

College of Civil Engineering and Environment, Zhengzhou University of Aeronautics, Zhengzhou 450015, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(10), 1695; https://doi.org/10.3390/math13101695

Submission received: 8 April 2025 / Revised: 12 May 2025 / Accepted: 16 May 2025 / Published: 21 May 2025

Download

Browse Figures

Versions Notes

Abstract

The Shanghai Containerized Freight Index (SCFI) and Ningbo Containerized Freight Index (NCFI) serve as crucial indicators for management and decision-making in China’s shipping industry. This study proposes a novel real-time rolling decomposition forecasting system integrating multiple influencing factors. The framework consists of two core modules: data preprocessing and prediction. In the data preprocessing stage, the Hampel filter is utilized to filter and revise each raw containerized freight index dataset, eliminating the adverse effects of outliers. Additionally, variational mode decomposition (VMD) technique is employed to decompose the time series in a rolling manner, effectively avoiding data leakage while extracting significant features. In the forecasting stage, the cheetah optimization algorithm (COA) optimizes the key parameters of the extreme gradient boosting (XGBoost) model, enhancing forecasting accuracy. The empirical analysis based on SCFI and NCFI data reveals that historical pricing serves as a critical determinant, with our integrated model demonstrating superior performance compared to existing methodologies. These findings substantiate the model’s robust generalization capability and operational efficiency across diverse shipping markets, highlighting its potential value for managerial decision-making in maritime industry practices.

Keywords:

container freight index; Hampel filter; variational mode decomposition; cheetah optimization algorithm; machine learning

MSC:

68T07; 68T20

1. Introduction

Container transportation, which serves as a crucial pillar of global trade, plays a central role in the maritime transport of goods across oceanic routes, which is vital for maintaining the smooth operation of global supply chains [1]. Given the significant proportion of container shipping costs in overall maritime expenditures, the accurate forecasting of future freight rate fluctuations has become essential for international maritime participants to formulate effective strategies [2]. However, in recent years, market volatility, economic downturns, and persistent uncertainties have led to abnormal fluctuations in container freight rates. Coupled with the influences of global economic conditions and seasonal factors, the global shipping market has become increasingly unstable [3]. Such sharp price fluctuations not only affect the profitability of the container shipping sector but also escalate commercial risks and heighten the survival pressures faced by industry participants. Therefore, precise forecasting of container freight trends is essential for effective risk management and informed decision-making. The inherent complexity of price time series, characterized by non-stationarity, nonlinearity, high volatility, and structural breakpoints, further complicates the achievement of precise predictions [4].

In the field of container freight index forecasting, common research methods include traditional econometric models, artificial intelligence models, and their combinations. Table 1 shows the relevant forecasting studies in this field. Traditional econometric models, such as the Holt–Winters method for triple exponential smoothing [5], the seasonal autoregressive integrated moving average (SARIMA) [6], and the generalized autoregressive conditional heteroskedasticity (GARCH) model [7], remain classic tools for time series forecasting. Koyuncu et al. [8] utilized the SARIMA model to predict the impact of COVID-19 on container throughput indices, revealing that container throughput would continue to decline three months after the outbreak. Liu et al. [9] found that combining the autoregressive (AR) model with the GARCH model exhibited stable predictive performance during the financial crisis and under recent market conditions in their study of BDI. Furthermore, the research conducted by Bildirici et al. [10] introduced a combination of smooth transition autoregressive (STAR) models with GARCH models, providing new tools for addressing economic time series characterized by complex nonlinearities and volatility. However, for nonlinear time series data, relying solely on econometric models can limit the extraction of valuable information, thus constraining forecasting capabilities [11].

Due to the inability of econometric models to fully capture the nonlinear characteristics of data, there is a growing interest in nonlinear forecasting methods. Such methods primarily include machine learning and deep learning approaches [22,23]. Xue et al. [24] compared the performance of ARIMA, backpropagation neural networks (BP), and extreme learning machine (ELM) models under different time scales and forecasting scenarios, suggesting that neural network algorithms can significantly enhance predictive performance in long-term forecasts. Additionally, Katris et al. [25] effectively captured the hidden nonlinear features in BDI forecasts using machine learning models. They explored the applicability of these models across different periods by comparing ARIMA, fractional ARIMA (FARIMA), and GARCH models with machine learning models such as support vector regression (SVR). Further research by Shih et al. [26] demonstrated that integrating long short-term memory (LSTM) deep learning models with the Shapley value theory from cooperative game theory improved the accuracy of predictions for CCFI. However, single machine learning models exhibit limitations in exploring the complexity, nonlinearity, and multiscale characteristics of time series data [27], often leading to local minima or overfitting during the forecasting process.

To address the complexities of data arising from multiple factors, as well as issues related to model overfitting and underfitting, researchers have begun to combine various models [28] or integrate decomposition, optimization, and forecasting models [29] to construct hybrid forecasting models that enhance the robustness and stability of predictions. Typical decomposition algorithms include wavelet transform (WT) [14], variational mode decomposition (VMD) [30], empirical mode decomposition (EMD) [31], and complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) [13]. For instance, Chen et al. [32] developed a hybrid model combining EMD, gray fluctuation, and ARIMA for the analysis and prediction of CCFI, which effectively captured the complex nonlinear characteristics of the CCFI time series. Although the classical EMD has led to acceptable results in many studies, it may need some improvements [33]. Therefore, Bagherzazdeh et al. demonstrated significant effectiveness of the empirical mode decomposition algorithm enhanced by multi-objective optimization in aircraft ice detection [34]. Additionally, Zhang et al. [35] proposed a hybrid carbon price forecasting model based on CEEMDAN and a window-based XGBoost method, demonstrating that the CEEMDAN-XGBoost model consistently outperforms the EEMD-XGBoost approach.

In many previous studies, the prediction of shipping price indices primarily relied on historical time series data. However, considering the characteristics of the shipping market as a complex system, its actual operations are influenced by various factors, including global supply chain dynamics [20], financial markets [36], core fuel prices [37], and uncertainties in economic development [38,39]. Inglada-Pérez et al. [40] systematically applied various chaos detection tools and methods to conduct an in-depth analysis of the dry bulk shipping market, revealing evident chaotic behaviors within the market. Bae and Chou [21,41] introduced macroeconomic variables and global economic indices to assess their impact on freight price indices when predicting the BDI, thereby enriching the dimensions of their predictive models. In another study, Yang et al. [42] improved the accuracy of future freight predictions by combining artificial neural networks with forward freight agreement (FFA) data. Additionally, Tsioumas et al. [43] confirmed the causal relationship between certain commodity prices and dry bulk freight rates. Papadimitriou et al. [16] established a new composite index that integrates China’s steel production, dry bulk fleet development, and the Dry Bulk Economic Climate Index (DBECI), significantly enhancing the predictive accuracy of the BDI. To address the issue of traditional methods’ excessive reliance on historical data and to improve responsiveness to sudden market changes, Liu et al. [12] proposed a forecasting model that integrates Baidu Search Index and event-driven factors, aiming to comprehensively assess the impact of these elements on freight volume predictions. Tsouknidis [44] focused on the volatility spillover effects between different segments of the shipping market, revealing significant and time-varying mechanisms of mutual price fluctuation influences among shipping markets. This indicates that the volatility of one market can affect the price fluctuations of related markets under certain conditions, with the degree of such influence being dynamically variable.

Despite these challenges, the academic community has devoted significant attention to certain shipping price indices, such as the China Containerized Freight Index (CCFI) and the Baltic Dry Index (BDI) [19]. Makridakis et al. [45] utilized Bayesian regression classifier (BRC) models to achieve accurate predictions of BDI. Zhang et al. [15] employed complete ensemble empirical mode decomposition (CEEMD) to decompose six major datasets from the Chinese shipping market and used a particle swarm optimization (PSO) enhanced bidirectional long short-term memory (BiLSTM) network for forecasting, efficiently capturing the non-stationary and nonlinear features of the data. Li [29] conducted research on the China Coastal Bulk Coal Freight Index (CBCFI) and established a combined ARMA-GM-GABP prediction model, which can effectively address the forecasting problem of CBCFI. This focus provides valuable insights for a deeper understanding of the fluctuations in the shipping market.

A review of past studies reveals several potential research gaps in container freight index prediction. First, compared to the BDI and the CCFI, the Shanghai Containerized Freight Index (SCFI) and the Ningbo Containerized Freight Index (NCFI) have not received equivalent levels of attention. Second, the existing literature predominantly focuses on historical data when studying container freight indices, overlooking other potential influencing factors, which results in a lack of comprehensiveness in forecasting these indices. Third, during the data preprocessing stage, many studies have not adequately prioritized the identification and correction of data outliers. Furthermore, traditional methods tend to decompose the entire dataset in one go before partitioning it into training and testing sets, thereby neglecting the risk of data leakage.

Considering the limitations of existing research, this paper proposes a novel forecasting framework for container freight indices that comprehensively considers multiple influencing factors. The primary contributions of this study are as follows:

(1): This research achieves high-accuracy predictions for the Shanghai Containerized Freight Index (SCFI) and the Ningbo Containerized Freight Index (NCFI), filling a significant gap in this field. By investigating these two freight indices, we not only demonstrate the robustness and superior performance of the proposed real-time forecasting system but also provide strong decision-making support for shipping market managers, enhancing their capacity to withstand market risks.
(2): In addition to traditional forecasting based on historical container freight data, this study incorporates various factors as input variables for the forecasting framework, including energy prices, commodity prices, Baidu index data, and indices of similar types. This results in the establishment of a multi-factor hybrid forecasting model, which significantly improves prediction performance.
(3): A two-stage data preprocessing scheme is proposed. Initially, the Hampel filter is employed for outlier identification and correction. Subsequently, a real-time rolling data decomposition technique based on VMD is applied to further refine the preprocessed data, substantially enhancing the prediction accuracy of the system.
(4): To optimize the performance and training efficiency of the XGBoost model, this paper introduces a novel intelligent optimization algorithm—the cheetah optimization algorithm (COA). COA is utilized to optimize key parameters of the XGBoost model, including feature sampling ratio (colsample_bytree), number of trees (n_estimators), learning rate (learning_rate), maximum tree depth (max_depth), and the proportion of rows sampled for each tree (subsample), thereby enhancing the overall predictive efficacy of the model.

2. Materials and Methods

This section outlines the pertinent algorithms and the research framework of the proposed hybrid forecasting model. Table 2 presents a list of Acronyms. Table 3 presents a list of notations.

2.1. Hampel Filter

The presence of outliers can negatively impact model training, leading to significant bias in prediction results [46]. The Hampel filter assumes that the given dataset follows a specific distribution and probability model, using an inconsistency test to process the data sequence accordingly [47]. The method has been applied in related research, such as CCFI [13]. The Hampel filter employs a robust moving estimate method, typically using a rolling median and the median absolute deviation (MAD) to identify outliers in time series data [48]. The Hampel principle and its mathematical formulas are as follows:

Given a time series X₁, X₂, …, X_n and a sliding window of length k, the median is calculated using Equation (1), and the standard deviation estimate is calculated using Equation (2):

m_{i} = m e d i a n (x_{i - k}, x_{i - k + 1}, \dots, x_{i}, \dots, x_{i + k - 1}, x_{i + k})

(1)

σ_{i} = A m e d i a n (| x_{i - k} - m_{i} |, \dots, | x_{i + k} - m_{i} |)

(2)

A = \frac{1}{\sqrt{2} {e r f}^{- 1} (1 / 2)} \approx 1.4826

(3)

m_{i}

represents the median of a sliding window consisting of 2k + 1 points centered on the current point

x_{i}

.

σ_{i}

is the estimated standard deviation. The coefficient A is based on the inverse error function and is approximately 1.4826 for normally distributed data, converting MAD to a scale comparable to the standard deviation. The Hampel filter uses

m_{i}

and

σ_{i}

to identify outliers. Typically, if the difference between a point

x_{i}

and its corresponding median

m_{i}

exceeds

{3 σ}_{i}

, that point is classified as an outlier. Once an outlier is detected, it is replaced by the rolling median

m_{i}

. In this study, we have used k = 5. When k = 5, it not only effectively corrects outliers but also avoids the distortion of the dataset [49].

2.2. Variational Mode Decomposition

Variational mode decomposition (VMD) is an advanced time-frequency analysis method known for its strong adaptability and high time-frequency resolution [50]. It demonstrates unique advantages in time series decomposition and has been widely applied in various fields, including the crude oil market [30], natural gas market [51], and financial stock market [52]. This method decomposes the original signal f(t) into intrinsic mode functions (IMFs), each with a distinct center frequency and bandwidth. A constraint ensures that each mode has a limited bandwidth with a specific center frequency component, and the total bandwidth of all modes is minimized as the objective function. The model requires the sum of all modes to equal the original signal.

The mode function (IMF) defines a single-component amplitude-modulated and frequency-modulated signal, expressed as

u_{k} (t) = A_{k} (t) c o s [φ_{k} (t)]

(4)

u_{k} (t)

is the mode function,

A_{k} (t)

is the envelope amplitude, representing the instantaneous amplitude of the mode function, and

φ_{k} (t)

the non-decreasing phase function.

Using the Hilbert transform and introducing an exponential term to blend the center frequencies of all modes, the model calculates the gradient square norm via Gaussian smoothing. The specific formulation of the model is as follows:

\{\begin{matrix} {m i n}_{\{u_{k}\} \{ω_{k}\}} \{\sum_{k = 1}^{K} \partial_{t} [(δ (t) + \frac{j}{π t} \otimes u_{k} (t))] e^{- j ω_{k}^{t}} {| |}_{2}^{2}\} \\ s . t . \sum_{k = 1}^{K} u_{k} = f \end{matrix}

(5)

In these equations, {

u_{k}

} and {

ω_{k}

} represent the set of intrinsic mode functions obtained from the signal decomposition and their corresponding center frequencies, respectively.

\partial_{t}

denotes the partial derivative with respect to t, δ(t) represents the Dirac delta function, j is the imaginary unit, and

\otimes

denotes convolution. VMD is a non-recursive signal decomposition model. It suppresses the effect of mode component aliasing in EMD and is theoretically well founded [52]. K is a pre-defined parameter that determines the number of extracted IMFs. The most common way adopted to choose K is utilizing the EMD method to evaluate how many modes should be decomposed [30]. In this study, we have used K = 6.

The core idea of VMD is to use the Hilbert transform to obtain analytic signals for each mode component. By shifting the mode components’ spectra to their base frequencies through exponential mixing and estimating the bandwidth of mode components using Gaussian smoothing, VMD can effectively identify and extract signal components with different temporal characteristics from the complex, nonlinear, and unstable container freight index time series.

2.3. Cheetah Optimization Algorithm

The cheetah optimization algorithm (COA) is a nature-inspired metaheuristic optimization algorithm proposed by Mohammad et al. [53] in 2022. This algorithm simulates the hunting mechanism of cheetahs, and its mathematical model includes three strategies: searching, waiting, and attacking. The parameters of the COA include the initial population size, number of iterations, and velocity update strategy. The optimization process is as follows.

2.3.1. Search Strategy

Cheetahs locate prey by scanning the environment or surrounding areas. During the hunt, cheetahs choose a search mode based on the prey’s condition, the area’s coverage, and their own state. The position update is expressed as follows

x_{i, j}^{t + 1} = x_{i, j}^{t} + r_{i, j}^{- 1} α_{i, j}^{t}

(6)

where

x_{i, j}^{t + 1}

and

x_{i, j}^{t}

represent the next and current positions of cheetah i in arrangement j, respectively. Index t denotes the current hunting time, and T is the maximum length of hunting time.

r_{i, j}^{- 1}

and

α_{i, j}^{t}

are the randomization parameter and step length for cheetah i in arrangement j, respectively. The randomization parameter

r_{i, j}^{- 1}

is normally distributed random numbers from a standard normal distribution. The step length

α_{i, j}^{t}

> 0 in most cases can be set at 0.001 × t/T as cheetahs are slow-walking searchers.

2.3.2. Sit-and-Wait Strategy

After detecting prey, if the current location is not favorable for an immediate attack, the cheetah waits for a better opportunity. This is mathematically modeled as

x_{i, j}^{t + 1} = x_{i, j}^{t}

(7)

This strategy increases the hunting success rate and helps avoid premature convergence.

2.3.3. Attack Strategy

This strategy involves two basic steps: (1) Chasing: When the cheetah decides to attack, it charges toward the prey at maximum speed. (2) Capturing: The cheetah uses its speed and agility to catch the prey by adjusting its position based on the prey’s escape direction and the positions of the leading or nearby cheetahs. This attack strategy is mathematically defined as

x_{i, j}^{t + 1} = x_{B, j}^{t} + θ_{i, j} β_{i, j}^{t}

(8)

In the equation, and B represents the prey;

x_{B, j}^{t}

is the current position of the prey in arrangement j.

θ_{i, j}

and

β_{i, j}^{t}

are the turning factor and interaction factor associated to the cheetah i in arrangement j, respectively.

θ_{i, j}

=

{|b_{i, j}|}^{e x p (b_{i, i} / 2)} s i n (2 π b_{i, j}), b_{i, j}

is normally distributed random numbers from a standard normal distribution.

In COA, each cheetah’s performance is evaluated based on its fitness function value across all dimensions. Initially, the search strategy is predominantly used, while the use of the attack strategy increases as iterations proceed to find better solutions. To decide which strategy to employ, two random numbers, r₂ and r₃, uniformly distributed in the interval [0, 1], are introduced. If r₂ ≥ r₃, the sit-and-wait strategy is chosen; otherwise, a third random value,

H = e^{2 (1 - t / T)} (2 r_{1} - 1)

determines whether to use the search or attack strategy, where r₁ is also a uniformly distributed random number in the interval [0, 1], t is the current iteration number, and T is the maximum number of iterations. By adjusting the value of r₃, the frequency of switching between the waiting strategy and the other two strategies can be controlled. If H ≥ r₄, the attack mode is selected; otherwise, the search mode is chosen. r₄ is a random number between 0 and 3.

2.4. COA-XGBoost

XGBoost, developed by Tianqi Chen et al. [54], is an ensemble learning algorithm based on decision trees, widely applied in classification and regression tasks. It offers high training efficiency and scalability through multithreaded parallel computation. By integrating multiple weak learners, XGBoost builds a strong learner, forming a composite model with enhanced predictive accuracy. During the construction of the composite model, XGBoost uses the gradient boosting method, iteratively generating new trees to fit the residuals of previous trees until the model reaches an optimal state.

The base learner of XGBoost is the Classification and Regression Tree (CART). Given a dataset with n samples and m features, the final prediction output from K CART trees is expressed as follows

{\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F

(9)

where

f_{k}

is the decision function of the kth tree and F is the function set space of all CART decision trees.

The objective function of XGBoost consists of two parts: the loss function and the regularization term. The general form of the objective function is as follows

L = \sum_{i} l (y_{i}, {\hat{y}}_{i}) + Ω (f_{k})

(10)

where

l (y_{i}, {\hat{y}}_{i})

is the loss function, that is, the error value of the true and predicted values.

Ω (f_{k})

is the canonical term, which controls model complexity.

The expression for the regularization term

Ω (f_{k})

is

Ω (f_{k}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}

(11)

where T represents the number of leaf nodes in the tree, while

ω_{j}

represents the weight assigned to the jth leaf node in the tree. The parameters γ and λ serve as penalty coefficients that promote smoother scores for each leaf node, thereby controlling the complexity of the tree and alleviating overfitting. The larger the values of γ and λ, the simpler the tree structure.

During model training, a gradient boosting strategy is used to add a new regression tree to the model one at a time, and the predicted value of the model under the ith sample

x_{i}

is

{\hat{y}}_{i}^{(t)} = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})

(12)

Equation (12) is brought into Equation (10) to obtain the following equation:

L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t}) + C

(13)

After applying a second-order Taylor expansion to the objective function, removing the constant term, and introducing the regularization term, the formula is expressed as (14)

L^{(t)} \approx \sum_{j = 1}^{T} [(\sum_{i \in I_{j}} g_{i}) ω_{j} + \frac{1}{2} (\sum_{i \in I_{j}} ~_{h i} + λ) ω_{j}^{2}] + γ T

(14)

In the equation,

g_{i} = \partial_{{\hat{y}}_{i}^{(L - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

and

h_{i} = \partial^{2} *_{{\hat{y}}_{i}^{(l - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

are first-order and second-order derivatives of L^(t) with respect to

{\hat{y}}_{i}^{(t - 1)}

.

Define,

G_{j} = \sum_{i \in J_{j}} g_{i}

and

H_{j} = \sum_{i \in I_{j}} h_{i}

, where

I_{j} = \{i ∣ q (x_{i}) = j\}

is the set of samples on each leaf. The objective function is then derived to find the leaf node optimum

ω_{j}^{*} = - \frac{G_{j}}{H_{j} + λ}

. Substituting the optimal solution, the final objective function is obtained as

L^{(t)} (q) = - \frac{1}{2} \sum_{j = 1}^{T} \frac{{(G_{j})}^{2}}{H_{j} + λ} + γ T

(15)

where q denotes a tree structure consisting of t decision trees. The goal is to minimize the loss function while considering the regularization term γT, which controls the tree’s complexity to prevent overfitting.

XGBoost is renowned for its efficiency, accuracy, and versatility. However, due to its numerous model parameters, different combinations of these parameters can significantly impact the model’s performance and predictive accuracy [55]. To address the challenges of setting XGBoost parameters, this study introduces a novel meta-heuristic algorithm with fast convergence and strong optimization capabilities—the cheetah optimization algorithm (COA). COA is used to optimize key XGBoost parameters, including the number of decision trees, maximum tree depth, feature ratio for training each tree, learning rate, and sample ratio for tree training. The optimized parameters are then used as the final settings for the XGBoost algorithm. Figure 1 illustrates the workflow of the COA-XGBoost algorithm. The execution steps are as follows:

Step 1: Initializing the COA. Set the maximum number of iterations, population size, and boundary parameters for the cheetah search range to prepare for the iterative process of optimizing the solution space.

Step 2: Fitness evaluation. Calculate the fitness value for each cheetah (parameter group). The cheetah with the lowest fitness value in the population is selected as the global optimal solution. According to reference [55], the fitness function is defined as the mean squared error (MSE) of the XGBoost model, expressed as

f (x_{i}) = \frac{1}{n} \sum_{j = 1}^{n} {(θ_{i, j} - {\hat{θ}}_{i, j})}^{2}

(16)

where

x_{i}

represents the location of the ith cheetah and

θ_{i, j}

denotes the jth true value associated with the ith individual. The variable

{\hat{θ}}_{i, j}

is the prediction value obtained by the XGBoost model based on the parameter settings of the ith cheetah.

Step 3: Updating the location. For each cheetah, update its position using three distinct strategies based on its current position relative to the global optimum.

Step 4: Updating the optimal solution. Determine the cheetah with the highest fitness in the current population as the current optimal solution and record it for comparison in subsequent iterations.

Step 5: Check for stopping criteria. Confirm whether the maximum number of iterations has been reached or the desired fitness threshold has been met. If the stopping criteria are satisfied, the algorithm ends; otherwise, the process returns to Step 3 for further iteration.

Step 6: Output the results. The final optimized search results from the COA are input into the XGBoost model for prediction.

2.5. Research Framework

This section elaborates on the hybrid forecasting framework designed for predicting container freight indices, detailing the various components of the forecasting system, including the outlier detection module, data decomposition module, and forecasting module. Figure 2 illustrates the architecture of the forecasting system.

Guided by the principles of multifactor influence and data preprocessing, a novel hybrid forecasting system for container freight indices has been developed.

First stage: outlier detection and data decomposition. In the first phase, a two-stage data preprocessing technique is applied to handle outliers and decompose the container freight index time series data. The Hampel filter, a statistical method specifically designed to detect and remove outliers caused by measurement errors or system failures, is employed to preprocess the time series data. Following Hampel filtering, the data is subjected to real-time rolling decomposition using the VMD. VMD is an effective nonlinear time series processing method that identifies and extracts signal components with different time-domain characteristics from the complex and nonlinear container freight index time series.

Second stage: feature selection for external factors and historical data. In the second phase, the study selects external factor features and historical container freight index data. The complexity of fluctuations in the container freight index results from the interplay of various external factors and the historical data itself. Therefore, this study selects influencing factors from energy prices, commodity prices, similar shipping price indices, and Baidu index data, among others. Moreover, considering the impact of historical data on future prices, the Partial Autocorrelation Function (PACF) is used to determine the lag order of the container shipping price index time series, with the maximum lag value being 3.

Third stage: COA-XGBoost model construction and forecasting. The third phase involves constructing the COA-XGBoost model and implementing the forecast. Deep learning algorithms exhibit significant advantages in handling high-dimensional data and uncovering complex underlying features. This study utilizes the COA to optimize the XGBoost algorithm for building the container freight index forecasting model. COA, a metaheuristic algorithm known for its fast convergence speed and strong optimization capability, effectively addresses the challenge of parameter tuning in deep learning algorithms and has not yet been applied in container freight index forecasting. Given XGBoost’s superior learning capacity and predictive accuracy, this study selects the COA-optimized XGBoost model as the foundation for container freight index forecasting.

Fourth stage: model performance evaluation. In the fourth phase, multiple accuracy metrics and statistical testing methods, detailed in Section 3.3, are used to comprehensively evaluate the performance of the constructed model. These metrics allow for an objective assessment of the model’s effectiveness and validate its applicability to container freight index forecasting.

3. Empirical Research

To validate the superiority and accuracy of the proposed hybrid prediction system, this study selected the SCFI and NCFI indices as the empirical research subjects. Two sets of experiments were conducted: the first set aimed to assess the performance of deep learning algorithms, while the second set focused on evaluating the effectiveness of decomposition algorithms and the overall performance of the proposed model.

3.1. Data Description

This study focuses on the container freight index data for the Chinese region. SCFI reflects the spot market rates for container shipments from the Port of Shanghai to major global destinations [56]. The SCFI is not only significant for the Port of Shanghai but also serves as an indicator of export trade for the entire Chinese and broader Asian regions. Additionally, the NCFI specifically tracks the export container freight rates from the Port of Ningbo [19]. As Ningbo is another critical hub for international trade in China, the NCFI holds considerable reference value as well. The Shanghai Containerized Freight Index (SCFI) is sourced from the Shanghai Shipping Exchange, and the Ningbo Containerized Freight Index (NCFI) is obtained from the Ningbo Shipping Exchange. All data used in this study were collected through the Choice Financial Terminal database. Due to the reliance on third-party data provided by Choice, there exists a certain degree of time lag in data availability. The datasets used in this study cover the period from 16 November 2012 to 24 May 2024, for the SCFI, and from 26 October 2012 to 7 June 2024, for the NCFI. Of these data, 80% were used for model training and construction, while the remaining 20% were reserved for model validation and evaluation. The training set selected for this study encompasses a complete price cycle, including phases of price increase, decrease, and fluctuation [57]. The time series of the container freight index exhibits pronounced nonlinear and unstable characteristics, which result from the interplay of various external factors. Therefore, considering the multiple factors influencing the container freight index aids in enhancing the accuracy of predictions. Accordingly, we identified predictive variables across five domains: energy pricing, commodity markets, analogous shipping indices, Baidu search index trends, and historical freight rates. A multidimensional predictive framework was constructed to comprehensively capture the variations in the container freight index.

3.2. Data Normalization

The collected influencing factors vary in magnitude and units. To ensure the comparability of the data, linear normalization is applied, converting the data to the same scale or unit. This facilitates comparison and analysis while enhancing the accuracy and performance of machine learning algorithms. The normalization expression is as follows

x * = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(17)

where

x *

represents the normalized value,

x

is the original data,

x_{m i n}

is the minimum value, and

x_{m a x}

is the maximum value in the dataset. After normalization, the data falls within the range of [0, 1].

3.3. Evaluation Metrics

To evaluate the performance of the proposed hybrid prediction system, four typical error evaluation metrics were applied in this study. The root mean square error (RMSE) is the square root of the average of the squared differences between the predicted values and the true values. The mean absolute error (MAE) is the average of the absolute differences between the predicted values and the true values. The mean absolute percentage error (MAPE) is the average of the absolute percentage differences between the predicted values and the true values. The Theil’s inequality coefficient (TIC) measures the relative error between the predicted sequence and the true sequence. Smaller metric values consistently indicate superior prediction performance across all measures. The formulas for each evaluation metric are shown in Table 4.

Here, n represents the number of samples,

x_{i}

and

f (x_{i})

represent the true value and predicted value of point i, respectively.

Additionally, the improvement percentages for RMSE, MAE, MAPE, and TIC were calculated. This was to observe the effectiveness of the proposed model in a more intuitive manner. The improvement evaluation indicators were calculated as follows:

P_{I n d e x} = |\frac{{I n d e x}_{m o d e l 1} - {I n d e x}_{m o d e l 2}}{{I n d e x}_{m o d e l 1}}| \times 100 %

(18)

Index_model₁ denotes the RMSE, MAE, MAPE, and TIC of the benchmark model, while Index_model₂ denotes the indicator values of the comparison model.

3.4. Experimental Setup

In this study, the selected models include SVR (Model 1), RF (Model 2), and XGBoost (Model 3), which are individual prediction models that do not incorporate decomposition techniques or optimization algorithms. CEEMDAN-COA-XGBoost (Model 4) and VMD-COA-XGBoost (Model 5) integrate the cheetah optimization algorithm (COA) and decomposition algorithms but do not consider outlier handling techniques. Hampel-CEEMDAN-COA-XGBoost (Model 6) and Hampel-VMD-COA-XGBoost (Model 7) further combine the Hampel filtering algorithm for outlier treatment. All models account for historical container shipping prices and other relevant factors. The characteristics of different models are shown in Table 5.

To validate the effectiveness, general applicability, and superiority of the proposed system, two experiments were designed and conducted. In Experiment 1, a comparative analysis was performed to evaluate the predictive performance of the three individual models—SVR, RF, and XGBoost—on two sets of shipping price index data to assess their strengths and weaknesses. To ensure a fair comparison, the baseline models, including SVR, RF, and XGBoost, were implemented by directly invoking the open-source machine learning library Scikit-learn in Python 3.7. The hyperparameters for all models were set to their default values. In Experiment 2, based on the XGBoost model, we introduced optimization algorithms, decomposition algorithms, and outlier handling techniques to enhance the model’s predictive performance. Through the analysis of Models 4 and 5, we verified the contribution of optimization and decomposition algorithms in building the prediction models. To further clarify the superiority of the proposed Hampel-VMD-COA-XGBoost (model 7) prediction system, we conducted a comparative analysis of Models 4, 5, 6, and 7 to demonstrate the gain effect of outlier handling techniques.

3.5. Experiment I

In Experiment 1, a comparative analysis was designed for each shipping index. The objective was to identify the optimal baseline model by comparing the performance of three models—support vector regression (SVR), random forest (RF), and XGBoost—on the SCFI and NCFI datasets. Figure 3 presents the prediction results of the three models for the two datasets, while Table 6 lists the evaluation metrics for each model. By comparing the predictive performance of the individual models, it is evident that the XGBoost model offers the greatest predictive advantage.

For instance, in Table 6 and Figure 3a, on the SCFI dataset, the RMSE values for SVM, RF, and XGBoost are 68.8425, 50.4306, and 36.7491, respectively. Similarly, XGBoost outperforms the other models across all evaluation metrics, including MAE, MAPE, and TIC, with lower values. This demonstrates that the XGBoost model excels at capturing temporal information embedded in the time series data.

3.6. Experiment II

In Experiment 2, the predictive performances of the CEEMDAN-COA-XGBoost (Model 4) and VMD-COA-XGBoost (Model 5) models were compared to explore the impact of different decomposition algorithms on model prediction capabilities. Figure 4 illustrates the prediction outcomes of Models 4 and 5 for both datasets, while Table 7 lists the evaluation metrics for these two models. The results reveal that the VMD decomposition algorithm exhibits a notable advantage in constructing predictive models. For instance, as clearly depicted in Figure 4b and Table 4 for the Ningbo dataset, Model 5 outperforms Model 4. Specifically, Model 5 achieves lower values across all four evaluation metrics—RMSE, MAE, MAPE, and TIC—with respective scores of 8.3489, 5.7984, 0.0043, and 0.0023, indicating its superior performance over Model 4. These findings suggest that the VMD can more effectively extract useful information from time series data, thereby enhancing the performance of predictive models.

Furthermore, to evaluate the applicability of the Hampel filter, the Hampel-CEEMDAN-COA-XGBoost (Model 6) and Hampel-VMD-COA-XGBoost (Model 7) models were designed and their predictions compared. Figure 5 presents the prediction outcomes of these two models for both datasets (“H-real” denoting the corrected actual values), and Table 7 provides the corresponding evaluation metrics. By comparing the performances of Model 4 with Model 6, and Model 5 with Model 7, the contribution of Hampel filter to the proposed forecasting system was further validated. The experimental results demonstrate a significant enhancement in prediction accuracy following the introduction of Hampel filter, underscoring the importance of data preprocessing, particularly outlier handling, in improving forecast precision.

3.7. Feature Importance Analysis

Feature importance analysis can aid government decision-making in various areas, such as gaining insights into energy markets, optimizing policy design, and formulating cross-sector policies, by identifying factors that significantly influence the target prediction values [58]. In this study, the feature importance analysis is conducted using the built-in feature importance attribute of the XGBoost model. Table 8 shows the ranking results of feature importance.

It is observed that, firstly, historical data serves as the most important source of prediction for both container freight indices. The importance of historical data in the SCFI and NCFI datasets reaches 0.2133 and 0.2229, respectively. Compared to other features, historical freight data dominates the analysis. This is due to the fact that the freight rate time series encapsulates the inherent complexities of the market, containing vital information regarding past performance and rate fluctuations. Such data enables researchers to analyze and predict future price trends. Moreover, the time series data of container freight indices exhibit autocorrelation, which explains why most prior studies on freight rate forecasting have predominantly utilized historical freight sequences, affirming that historical freight data is an indispensable factor in predicting container freight rates. Secondly, aside from historical data, the significance of analogous shipping price indices in the SCFI and NCFI datasets ranks highest at 0.4373 and 0.3185, respectively. This suggests a long-term equilibrium and mutual guiding relationship among the various shipping indices. Thirdly, energy factors, commodity prices, and market sentiment also contribute to the prediction targets. This is attributable to the indirect connections these factors have with container freight rates. During periods of economic prosperity and optimistic market sentiment, container freight rates tend to fluctuate. Additionally, an increase in fossil fuel prices raises the operational costs for transport equipment relying on these energy sources, subsequently leading to an increase in the container freight index.

4. Discussion

4.1. Improvement Percentages

In this section, we discuss the percentage improvements in evaluation metrics across different models in the two experimental datasets to demonstrate the advantages of the proposed prediction model and draw some important insights. The improvement percentages are presented in Table 9, offering a comparative analysis of the advantages of the proposed real-time hybrid prediction model. The discussion follows:

(1): Across all evaluation metrics, the proposed Hampel-VMD-COA-XGBoost prediction model outperforms the other models. For example, in the SCFI dataset, according to the results presented in Table 7, the RMSE, MAE, MAPE, and TIC of the proposed prediction framework are 3.3268, 2.1778, 0.0009, and 0.0007, respectively.
(2): In both datasets, the predictive performance of the XGBoost model surpasses that of SVR and RF models. This demonstrates the superior ability of the XGBoost model to leverage past and future time series information for accurate predictions. It also confirms XGBoost’s advantage in extracting hidden information from time series data.
(3): A comparative analysis between the standalone XGBoost model and XGBoost models incorporating decomposition modules and optimization algorithms reveals a significant improvement in accuracy when the COA algorithm and decomposition modules are applied. For example, in the NCFI dataset, as shown in Table 9, when comparing XGBoost with CEEMDAN-COA-XGBoost and XGBoost with VMD-COA-XGBoost, the models incorporating decomposition and optimization techniques show notable improvements in all evaluation metrics. This highlights the contributions of decomposition techniques and optimization algorithms to the prediction results. Furthermore, when comparing CEEMDAN-COA-XGBoost and VMD-COA-XGBoost, the VMD-COA-XGBoost model shows improvements of 25.42%, 30.79%, 35.82%, and 25.81% across the evaluation metrics. Therefore, VMD is considered more suitable for handling such nonlinear time series data, aiding in improving both the accuracy and stability of the predictive model.
(4): CEEMDAN-COA-XGBoost was compared with Hampel-CEEMDAN-COA-XGBoost and VMD-COA-XGBoost was compared with Hampel-VMD-COA-XGBoost. For the SCFI dataset, as shown in Table 9, the prediction systems incorporating Hampel outlier detection achieved significantly improved prediction performance. This further validates the effectiveness of the Hampel filtering algorithm. Additionally, compared to the Hampel-CEEMDAN-COA-XGBoost model, the proposed prediction system showed improvements of 36.96%, 44.82%, 62.50%, and 41.67% across various evaluation metrics, indicating that the proposed system outperforms other comparative models.

From this comparative analysis, the developed hybrid prediction model significantly outperforms both baseline and other hybrid models in both datasets. Compared to standalone models and other hybrid models, the designed prediction framework, based on multi-factor influence, two-stage data preprocessing, and real-time rolling predictions, fully captures and learns the relationships between key variables and future container freight rates, achieving successful predictive results.

4.2. Comparison of the Proposed Model and Existing Model

In recent years, research on container freight indices significantly increased. To demonstrate the effectiveness of the proposed model, this study compares it with existing hybrid models for predicting other container freight indices. Using the SCFI dataset as a case study, the comparison of models is introduced as follows:

Hirata et al. [56] introduced the LSTM deep learning model into the realm of SCFI prediction, empirically demonstrating its advantages over traditional time series models in enhancing prediction accuracy and comprehending complex market dynamics. This innovative approach has significantly advanced the field of container freight index forecasting. By comparing the proposed model with this benchmark, we can evaluate its performance in predicting container freight indices. The proposed model and the comparative models are assessed using the RMSE evaluation metric, with results presented in Table 10. Additionally, to ensure rigorous comparison, the forecasting timeframes were standardized.

From Table 10, it is evident that in the prediction comparisons for SCFI, the RMSE of the proposed model outperforms that of the comparative models. The analysis reveals that the proposed model exhibits superior performance when compared to other models.

4.3. Limitations of the Present Study and Future Work

This study presents a container freight rate forecasting method based on multi-factor influence and data preprocessing, thereby enhancing the accuracy of freight rate predictions and providing better decision-making support for governments, businesses, and investors. Despite its strong predictive performance, the study has certain limitations.

From the model perspective: (1) The selection of parameter combinations for heuristic optimization algorithms is challenging. Researchers often need to invest significant time and computational resources in finding suitable parameter settings, which may reduce the model’s efficiency and effectiveness. (2) The integration of two-stage data preprocessing within the hybrid model increases operational complexity, leading to longer processing times. This could pose a bottleneck in practical applications, especially in scenarios requiring rapid responses.

Regarding the study of container freight rates: (1) The indicator system developed in this study requires further refinement. Future research should consider incorporating factors such as national policies, exchange rates, and container turnover rates to enhance the model’s comprehensiveness and practicality. (2) The study focuses only on short-term predictions of container freight rates, without conducting a comprehensive analysis of long-term price trends. This limits the model’s applicability in long-term forecasting scenarios.

Therefore, future work should focus on improving the simplicity and efficiency of optimization algorithms and data processing techniques to develop a more intelligent and user-friendly prediction model. Additionally, the indicator system should be expanded by including other influential factors, such as container turnover rates and quantifiable policy variables, to create a more comprehensive and robust framework.

5. Policy Implications and Conclusions

5.1. Policy Recommendations

Based on the practical significance of this study, the following policy recommendations are suggested:

First, emphasize the importance of historical container freight rates. The feature importance analysis of the container freight index indicates that historical rates are among the key indicators for freight rate forecasting. Understanding their trends and patterns can provide valuable insights for predicting future container freight rates. The government should establish a comprehensive shipping price database, including historical container freight rates and container transaction volumes from various shipping markets, to enable in-depth analysis by researchers.

Second, proactively address energy price fluctuations. Energy price volatility directly impacts production costs and energy choices for companies, exerting a complex influence on container shipping rates. Since maritime transport still heavily relies on oil and coal, fluctuations in oil and coal prices significantly affect shipping costs. When these prices rise, the government should implement measures to alleviate the cost pressures on enterprises. Specific actions should include encouraging companies to enhance energy efficiency and upgrade technologies, while providing financial support, low-interest loans, or other incentives to help companies optimize their energy usage. At the same time, advancing the development of clean energy sources is critical. By increasing the supply of clean energy, dependence on traditional fossil fuels can be reduced, thereby mitigating the impact of energy price fluctuations on business costs.

Finally, monitor changes in market interdependence and spillover effects. Continuous attention should be given to the influence of factors such as financial markets and commodities on container freight rates. Through data analysis, including news reports and social media discussions, governments can better understand market sentiment and public perceptions of the container transport industry. This information will enable policymakers to develop more informed policies to address various market changes and emerging trends, while formulating appropriate strategies in response to evolving conditions.

5.2. Conclusions

This study proposes a hybrid prediction system for container shipping prices that integrates multiple factors within a decomposition-based framework. By combining two-stage data preprocessing techniques with the cheetah optimization algorithm (COA), the Hampel-VMD-COA-XGBoost ensemble learning model was ultimately established. Real-time rolling forecasts were conducted on two container shipping price datasets (SCFI and NCFI), achieving accurate predictions. Empirical analyses on the SCFI and NCFI datasets were performed, where four evaluation metrics were introduced to assess the prediction accuracy and stability of different models across various datasets. The results demonstrated that the developed method outperformed six other models in terms of prediction performance. The key conclusions from the empirical analysis are as follows:

(1): Across both the SCFI and NCFI datasets, the proposed prediction framework consistently outperformed all comparative models, underscoring its superiority and generalizability in predicting container shipping price indices.
(2): Integrating multiple external factors yielded richer predictive insights, resulting in more reliable outcomes. This highlights the importance of multi-factor analysis in enhancing the accuracy and stability of prediction models.
(3): Effective data preprocessing mechanisms can mitigate the impact of data noise, enabling the model to better capture the underlying data characteristics, thereby improving the predictive performance of the system.
(4): The use of the cheetah optimization algorithm, with its exceptional optimization capabilities, allowed the XGBoost model to fully leverage its potential, leading to superior prediction results.
(5): A hybrid prediction system for container shipping prices was established, with XGBoost—known for its strong generalization capabilities—serving as the predictor and integration tool. Ultimately, the system successfully captured deep feature information from container shipping prices and achieved nonlinear integration of the prediction results.

Author Contributions

Methodology, Y.D. and X.Z.; Validation, Y.D. and X.Z.; Investigation, X.W.; Resources, Y.F and K.L.; Data curation, X.Z. Writing—original draft, X.Z.; Writing—review, and editing, Y.F. and K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Henan Provincial Natural Science Foundation (Grant No.242300421257).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Yin, J.; Shi, J. Seasonality patterns in the container shipping freight rate market. Marit. Policy Manag. 2018, 45, 159–173. [Google Scholar] [CrossRef]
Saeed, N.; Nguyen, S.; Cullinane, K.; Gekara, V.; Chhetri, P. Forecasting container freight rates using the Prophet forecasting method. Transp. Policy 2023, 133, 86–107. [Google Scholar] [CrossRef]
Munim, Z.H.; Schramm, H.-J. Forecasting container shipping freight rates for the Far East—Northern Europe trade lane. Marit. Econ. Logist. 2016, 19, 106–125. [Google Scholar] [CrossRef]
Hao, J.; Yuan, J.; Wu, D.; Xu, W.; Li, J. A dynamic ensemble approach for multi-step price prediction: Empirical evidence from crude oil and shipping market. Expert Syst. Appl. 2023, 234, 121117. [Google Scholar] [CrossRef]
Wang, S.; Wei, F.; Li, H.; Wang, Z.; Wei, P. Comparison of SARIMA model and Holt-Winters model in predicting the incidence of Sjögren’s syndrome. Int. J. Rheum. Dis. 2022, 25, 1263–1269. [Google Scholar] [CrossRef]
Zhao, H.-M.; He, H.-D.; Lu, K.-F.; Han, X.-L.; Ding, Y.; Peng, Z.-R. Measuring the impact of an exogenous factor: An exponential smoothing model of the response of shipping to COVID-19. Transp. Policy 2022, 118, 91–100. [Google Scholar] [CrossRef]
Liu, Z.; Huang, S. Carbon option price forecasting based on modified fractional Brownian motion optimized by GARCH model in carbon emission trading. North Am. J. Econ. Financ. 2021, 55, 101307. [Google Scholar] [CrossRef]
Koyuncu, K.; Tavacioğlu, L.; Gökmen, N.; Arican, U.Ç. Forecasting COVID-19 impact on RWI/ISL container throughput index by using SARIMA models. Marit. Policy Manag. 2021, 48, 1096–1108. [Google Scholar] [CrossRef]
Liu, J.; Li, Z.; Sun, H.; Yu, L.; Gao, W. Volatility forecasting for the shipping market indexes: An AR-SVR-GARCH approach. Marit. Policy Manag. 2021, 49, 864–881. [Google Scholar] [CrossRef]
Bildirici, M.; Şahin Onat, I.; Ersin, Ö.Ö. Forecasting BDI Sea Freight Shipment Cost, VIX Investor Sentiment and MSCI Global Stock Market Indicator Indices: LSTAR-GARCH and LSTAR-APGARCH Models. Mathematics 2023, 11, 1242. [Google Scholar] [CrossRef]
Liao, H.; Zeng, J.; Wu, C. A model for online forum traffic prediction integrated with multiple models. Comput. Eng. 2020, 46, 62–66. [Google Scholar]
Liu, J.; Chu, N.; Wang, P.; Zhou, L.; Chen, H. A novel hybrid model for freight volume prediction based on the Baidu search index and emergency. Neural Comput. Appl. 2023, 36, 1313–1328. [Google Scholar] [CrossRef]
Yin, K.; Guo, H.; Yang, W. A novel real-time multi-step forecasting system with a three-stage data preprocessing strategy for containerized freight market. Expert Syst. Appl. 2024, 246, 123141. [Google Scholar] [CrossRef]
Han, Q.; Yan, B.; Ning, G.; Yu, B. Forecasting Dry Bulk Freight Index with Improved SVM. Math. Probl. Eng. 2014, 2014, 460684. [Google Scholar] [CrossRef]
Zhang, Q.; Li, C.; Wang, X.; Hu, Y.; Yan, Y.; Jin, H.; Shang, G. Forecasting shipping index using CEEMD-PSO-BiLSTM model. PLoS ONE 2023, 18, e0280504. [Google Scholar] [CrossRef]
Tsioumas, V.; Papadimitriou, S.; Smirlis, Y.; Zahran, S.Z. A Novel Approach to Forecasting the Bulk Freight Market. Asian J. Shipp. Logist. 2017, 33, 33–41. [Google Scholar] [CrossRef]
Du, P.; Wang, J.; Yang, W.; Niu, T. Container throughput forecasting using a novel hybrid learning method with error correction strategy. Knowl. -Based Syst. 2019, 182, 104853. [Google Scholar] [CrossRef]
Xie, G.; Zhang, N.; Wang, S. Data characteristic analysis and model selection for container throughput forecasting within a decomposition-ensemble methodology. Transp. Res. Part E Logist. Transp. Rev. 2017, 108, 160–178. [Google Scholar] [CrossRef]
Schramm, H.-J.; Munim, Z.H. Container freight rate forecasting with improved accuracy by integrating soft facts from practitioners. Res. Transp. Bus. Manag. 2021, 41, 100662. [Google Scholar] [CrossRef]
Tu, X.; Yang, Y.; Lin, Y.; Ma, S. Analysis of influencing factors and prediction of China’s Containerized Freight Index. Front. Mar. Sci. 2023, 10, 1245542. [Google Scholar] [CrossRef]
Bae, S.-H.; Lee, G.; Park, K.-S. A Baltic Dry Index Prediction using Deep Learning Models. J. Korea Trade 2021, 25, 17–36. [Google Scholar] [CrossRef]
Ghareeb, A. Time Time Series Forecasting of Stock Price for Maritime Shipping Company in COVID-19 Period Using Multi-Step Long Short-Term Memory (LSTM) Networks. Proc. Int. Conf. Bus. Excell. 2023, 17, 1728–1747. [Google Scholar] [CrossRef]
Xiao, W.; Xu, C.; Liu, H.; Liu, X.; Kim, D.-K. A Hybrid LSTM-Based Ensemble Learning Approach for China Coastal Bulk Coal Freight Index Prediction. J. Adv. Transp. 2021, 2021, 5573650. [Google Scholar] [CrossRef]
Zhang, X.; Xue, T.; Eugene Stanley, H. Comparison of Econometric Models and Artificial Neural Networks Algorithms for the Prediction of Baltic Dry Index. IEEE Access 2019, 7, 1647–1657. [Google Scholar] [CrossRef]
Katris, C.; Kavussanos, M.G. Time series forecasting methods for the Baltic dry index. J. Forecast. 2021, 40, 1540–1565. [Google Scholar] [CrossRef]
Shih, Y.-C.; Lin, M.-S.; Lirn, T.-C.; Juang, J.-G. A new-type deep learning model based on Shapley regulation for containerized freight index prediction. J. Mar. Sci. Technol. 2024, 32, 8. [Google Scholar] [CrossRef]
Liu, S.; Huang, J.; Xu, L.; Zhao, X.; Li, X.; Cao, L.; Wen, B.; Huang, Y. Combined model for prediction of air temperature in poultry house for lion-head goose breeding based on PCA-SVR-ARMA. Trans. Chin. Soc. Agric. Eng. 2020, 36, 225–233. [Google Scholar]
Kamal, I.M.; Bae, H.; Sunghyun, S.; Yun, H. DERN: Deep Ensemble Learning Model for Short- and Long-Term Prediction of Baltic Dry Index. Appl. Sci. 2020, 10, 1504. [Google Scholar] [CrossRef]
Li, Z.; Piao, W.; Wang, L.; Wang, X.; Fu, R.; Fang, Y. China Coastal Bulk (Coal) Freight Index Forecasting Based on an Integrated Model Combining ARMA, GM and BP Model Optimized by GA. Electronics 2022, 11, 2732. [Google Scholar] [CrossRef]
Huang, Y.; Deng, Y. A new crude oil price forecasting model based on variational mode decomposition. Knowl.-Based Syst. 2021, 213, 106669. [Google Scholar] [CrossRef]
Zeng, Q.; Qu, C.; Ng, A.K.Y.; Zhao, X. A new approach for Baltic Dry Index forecasting based on empirical mode decomposition and neural networks. Marit. Econ. Logist. 2015, 18, 192–210. [Google Scholar] [CrossRef]
Chen, Y.; Liu, B.; Wang, T. Analysing and forecasting China containerized freight index with a hybrid decomposition–ensemble method based on EMD, grey wave and ARMA. Grey Syst. Theory Appl. 2020, 11, 358–371. [Google Scholar] [CrossRef]
Bagherzadeh, S.A.; Sabzehparvar, M. A local and online sifting process for the empirical mode decomposition and its application in aircraft damage detection. Mech. Syst. Signal Process. 2015, 54–55, 68–83. [Google Scholar] [CrossRef]
Bagherzadeh, S.A.; Asadi, D. Detection of the ice assertion on aircraft using empirical mode decomposition enhanced by multi-objective optimization. Mech. Syst. Signal Process. 2017, 88, 9–24. [Google Scholar] [CrossRef]
Zhang, C.; Zhao, Y.; Zhao, H. A Novel Hybrid Price Prediction Model for Multimodal Carbon Emission Trading Market Based on CEEMDAN Algorithm and Window-Based XGBoost Approach. Mathematics 2022, 10, 4072. [Google Scholar] [CrossRef]
Chou, C.-C.; Lin, K.-S. A fuzzy neural network combined with technical indicators and its application to Baltic Dry Index forecasting. J. Mar. Eng. Technol. 2018, 18, 82–91. [Google Scholar] [CrossRef]
Sahin, B.; Gurgen, S.; Unver, B.; Altin, I. Forecasting the Baltic Dry Index by using an artificial neural network approach. Turk. J. Electr. Eng. Comput. Sci. 2018, 26, 1673–1684. [Google Scholar] [CrossRef]
Gu, B.; Liu, J. Determinants of dry bulk shipping freight rates: Considering Chinese manufacturing industry and economic policy uncertainty. Transp. Policy 2022, 129, 66–77. [Google Scholar] [CrossRef]
Jeon, J.-W.; Duru, O.; Munim, Z.H.; Saeed, N. System Dynamics in the Predictive Analytics of Container Freight Rates. Transp. Sci. 2021, 55, 946–967. [Google Scholar] [CrossRef]
Inglada-Pérez, L.; Coto-Millán, P. A Chaos Analysis of the Dry Bulk Shipping Market. Mathematics 2021, 9, 2065. [Google Scholar] [CrossRef]
Chou, C.C.; Lin, K.S. A Fuzzy Neural Network Model for Analysing Baltic Dry Index in the Bulk Maritime Industry. Int. J. Marit. Eng. 2017, 159, A2. [Google Scholar] [CrossRef]
Yang, Z.; Mehmed, E.E. Artificial neural networks in freight rate forecasting. Marit. Econ. Logist. 2019, 21, 390–414. [Google Scholar] [CrossRef]
Tsioumas, V.; Papadimitriou, S. The dynamic relationship between freight markets and commodity prices revealed. Marit. Econ. Logist. 2016, 20, 267–279. [Google Scholar] [CrossRef]
Tsouknidis, D.A. Dynamic volatility spillovers across shipping freight markets. Transp. Res. Part E Logist. Transp. Rev. 2016, 91, 90–111. [Google Scholar] [CrossRef]
Makridakis, S.; Merikas, A.; Merika, A.; Tsionas, M.G.; Izzeldin, M. A novel forecasting model for the Baltic dry index utilizing optimal squeezing. J. Forecast. 2019, 39, 56–68. [Google Scholar] [CrossRef]
Shabbir, M.; Chand, S.; Iqbal, F.; Kisi, O. Hybrid Approach for Streamflow Prediction: LASSO-Hampel Filter Integration with Support Vector Machines, Artificial Neural Networks, and Autoregressive Distributed Lag Models. Water Resour. Manag. 2024, 38, 4179–4196. [Google Scholar] [CrossRef]
Park, C.-H.; Chang, J.-H. WLS Localization Using Skipped Filter, Hampel Filter, Bootstrapping and Gaussian Mixture EM in LOS/NLOS Conditions. IEEE Access 2019, 7, 35919–35928. [Google Scholar] [CrossRef]
Yin, S.; Liu, H. Wind power prediction based on outlier correction, ensemble reinforcement learning, and residual correction. Energy 2022, 250, 123857. [Google Scholar] [CrossRef]
Pearson, R.K.; Neuvo, Y.; Astola, J.; Gabbouj, M. The class of generalized hampel filters. In Proceedings of the Signal Processing Conference, Brisbane, Australia, 20 April 2015. [Google Scholar]
Dragomiretskiy, K.; Zosso, D. Variational Mode Decomposition. IEEE Trans. Signal Process. 2014, 62, 531–544. [Google Scholar] [CrossRef]
Li, J.; Wu, Q.; Tian, Y.; Fan, L. Monthly Henry Hub natural gas spot prices forecasting using variational mode decomposition and deep belief network. Energy 2021, 227, 120478. [Google Scholar] [CrossRef]
Li, Y.; Chen, L.; Sun, C.; Liu, G.; Chen, C.; Zhang, Y. Accurate Stock Price Forecasting Based on Deep Learning and Hierarchical Frequency Decomposition. IEEE Access 2024, 12, 49878–49894. [Google Scholar] [CrossRef]
Akbari, M.A.; Zare, M.; Azizipanah-abarghooee, R.; Mirjalili, S.; Deriche, M. The cheetah optimizer: A nature-inspired metaheuristic algorithm for large-scale optimization problems. Sci. Rep. 2022, 12, 10953. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Duan, Y.; Zhang, J.; Wang, X.; Feng, M.; Ma, L. Forecasting carbon price using signal processing technology and extreme gradient boosting optimized by the whale optimization algorithm. Energy Sci. Eng. 2024, 12, 810–834. [Google Scholar] [CrossRef]
Hirata, E.; Matsuda, T. Forecasting Shanghai Container Freight Index: A Deep-Learning-Based Model Experiment. J. Mar. Sci. Eng. 2022, 10, 593. [Google Scholar] [CrossRef]
Gao, N.; He, Y.; Ma, X. Exponential timing strategy based on EEMD-SVR predictive modeling of low frequency component. Stat. Decis. 2022, 38, 140–145. [Google Scholar]
Chang, M.Z.; Park, S. Predictions and analysis of flash boiling spray characteristics of gasoline direct injection injectors based on optimized machine learning algorithm. Energy 2023, 262, 125304. [Google Scholar] [CrossRef]

Figure 1. COA-XGBoost prediction model framework.

Figure 2. Structure of the proposed hybrid prediction system.

Figure 3. Prediction results of Models 1, 2, and 3.

Figure 4. Prediction results of Models 4 and 5.

Figure 5. Prediction results of Models 6 and 7.

Table 1. Summary of the relevant forecasting research work introduced in this article.

Reference	Objective(s)	Data Processing	Model	Influencing Factors
Liu [12]	Freight volume	Empirical mode decomposition (EMD)	Backpropagation neural network (BP)	Baidu search index, COVID-19 index, Historical Data
Yin [13]	CCFI	Hampel filter, Complete Ensemble empirical mode decomposition adaptive noise (CEEMDAN)	Extreme learning machine (ELM)	Historical data
Han [14]	BDI	Wavelet transform (WT)	Improved SVM	Historical data
Li [15]	FDI, BDI, CBFI, CBCFI, CFD, CTFI	Complementary ensemble empirical mode decomposition (CEEMD)	Bidirectional long short-term memory (BiLSTM)	Historical data
Tsioumas [16]	BDI	Empirical mode decomposition (EMD)	Multivariate vector autoregressive model with exogenous variables (VARX)	Chinese steel production, Dry bulk fleet development, Dry bulk economic climate index (DBECI)
Du [17]	Container throughputs	Variational mode decomposition (VMD)	Extreme learning machine (ELM)	Historical data
Xie [18]	Container throughputs	Seasonal decomposition method(X-12-ARIMA)	Autoregressive integrated moving average (ARIMA), Seasonal autoregressive integrated moving average (SARIMA), Squares support vector regression (LSSVR)	Historical data
Schramm [19]	Container freight rate	No	Vector autoregressive (VAR), Autoregressive integrated moving average (ARIMA), Autoregressive integrated moving average with exogenous variables (ARIMAX)	Logistics confidence Index (LCI), Historical data
Tu [20]	CCFI	No	Deep neural network (DNN), CatBoost regression model, Robust regression model	China coastal bulk freight index (CCBFI), Global: Aluminum price, Container throughput
Bae [21]	BDI	No	Artificial neural network (ANN), Recurrent neural network (RNN), Long short-term memory (LSTM)	Brent oil price, Coal price, Iron ore export volume

Table 2. A list of acronyms.

Acronym	Full Term
H	Hampel filter
MAD	Median absolute deviation
CEEMDAN	Complete ensemble empirical mode decomposition with adaptive noise
VMD	Variational mode decomposition
IMF	Intrinsic mode function
COA	Cheetah optimization algorithm
SVR	Support vector regression
RF	Random forest
XGBoost	Extreme gradient boosting
CART	Classification and regression tree

Table 3. A list of notations.

Notation	Explanation
$\otimes$	Convolutional operation
$\partial_{t}$	Partial derivative with respect to t
δ(t)	The Dirac delta function
L, L(t), $L^{(t)} (q)$	The objective function
$G_{j}$	The sums of the first order at the j-th node
$H_{j}$	The sums of the second order at the j-th node
$Ω$	The canonical term
F	The function set space of all CART decision trees

Table 4. Evaluation metrics.

Metric	Definition	Function
RMSE	Root mean squared error	$R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(f (x_{i}) - x_{i})}^{2}}$
MAE	Mean absolute error	$M A E = \frac{1}{n} \sum_{i = 1}^{n} \| f (x_{i} - x_{i}) \|$
MAPE	Mean absolute percentage error	$M A P E = \frac{\sum_{i = 1}^{n} \| \frac{f (x_{i}) - x_{i}}{x_{i}} \|}{n} * 100 %$
TIC	Theil inequality coefficient	$T I C = \frac{\sqrt{\frac{1}{n} {(f (x_{i}) - x_{i})}^{2}}}{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}} + \sqrt{\frac{1}{n} \sum_{i = 1}^{n} f {(x_{i})}^{2}}}$

Table 5. Characteristics of different models.

Models	Decomposition Technique	Evolutionary Algorithm	Outlier Handing
SVR(model 1)
RF(model 2)
XGBoost(model 3)
CEEMDAN-COA-XGBoost(model 4)	√	√
VMD-COA-XGBoost(model 5)	√	√
Hampel-CEEMDAN-COA-XGBoost(model 6)	√	√	√
Hampel-VMD-COA-XGBoost(model 7)	√	√	√

Table 6. Prediction results of Models 1, 2, and 3.

Datasets	Models	RMSE	MAE	MAPE	TIC
SCFI	SVR	68.8425	63.2738	0.0448	0.0152
	RF	50.4306	40.4269	0.0286	0.0112
	XGBoost	36.7491	13.8465	0.0077	0.0082
NCFI	SVR	50.8329	37.6292	0.0352	0.0142
	RF	41.0845	31.9748	0.0258	0.0115
	XGBoost	31.6979	19.8456	0.0158	0.0089

Table 7. Prediction results of Models 4, 5, 6, and 7.

Datasets	Models	RMSE	MAE	MAPE	TIC
SCFI	CEEMDAN-COA-XGBoost	15.1839	9.6145	0.0053	0.0034
	VMD-COA-XGBoost	7.2803	4.1971	0.0021	0.0016
	Hampel-CEEMDAN-COA-XGBoost	5.2771	3.9468	0.0024	0.0012
	Hampel-VMD-COA-XGBoost	3.3268	2.1778	0.0009	0.0007
NCFI	CEEMDAN-COA-XGBoost	11.1953	8.3779	0.0067	0.0031
	VMD-COA-XGBoost	8.3489	5.7984	0.0043	0.0023
	Hampel-CEEMDAN-COA-XGBoost	10.1763	7.4243	0.0061	0.0028
	Hampel-VMD-COA-XGBoost	7.5175	5.2427	0.0041	0.0021

Table 8. Feature importance ranking.

SCFI		NCFI
Characteristic	Importance	Characteristic	Importance
Historical data	0.2133	Historical data	0.2229
Aluminum price	0.2046	Crude price	0.2211
BDI	0.1693	CCFI	0.0949
CBCFI	0.1035	Container throughput	0.0832
CCBFI	0.0831	TDI	0.0689
CCFI	0.0738	CCBFI	0.0624
BDI Baidu index	0.0493	BDI	0.0528
Container throughput	0.0303	Shipping Baidu index	0.0527
Coal price	0.0252	Coal price	0.0475
Crude price	0.0294	Aluminum price	0.0440
TDI	0.0076	CBCFI	0.0395
Shipping Baidu index	0.0105	BDI Baidu index	0.0100

Table 9. The improvement percentage of the models.

Datasets	Benchmark Model		Comparative Model	P_RMSE	P_MAE	P_MAPE	P_TIC
SCFI	SVR	vs.	XGBoost	46.62%	78.12%	82.81%	46.05%
	RF	vs.	XGBoost	27.13%	65.75%	73.08%	26.79%
	XGBoost	vs.	CEEMDAN-COA-XGBoost	58.68%	30.56%	31.69%	58.54%
	XGBoost	vs.	VMD-COA-XGBoost	80.19%	69.69%	72.73%	80.49%
	CEEMDAN-COA-XGBoost	vs.	VMD-COA-XGBoost	52.05%	56.34%	60.38%	52.94%
	CEEMDAN-COA-XGBoost	vs.	Hampel-CEEMDAN-COA-XGBoost	65.25%	58.95%	54.72%	64.71%
	VMD-COA-XGBoost	vs.	Hampel-VMD-COA-XGBoost	54.30%	48.11%	57.14%	56.25%
	Hampel-CEEMDAN-COA-XGBoost	vs.	Hampel-VMD-COA-XGBoost	36.96%	44.82%	62.50%	41.67%
NCFI	SVR	vs.	XGBoost	37.64%	47.26%	55.11%	37.32%
	RF	vs.	XGBoost	22.85%	37.93%	38.76%	22.61%
	XGBoost	vs.	CEEMDAN-COA-XGBoost	64.68%	57.78%	57.59%	65.17%
	XGBoost	vs.	VMD-COA-XGBoost	73.66%	70.78%	72.78%	74.16%
	CEEMDAN-COA-XGBoost	vs.	VMD-COA-XGBoost	25.42%	30.79%	35.82%	25.81%
	CEEMDAN-COA-XGBoost	vs.	Hampel-CEEMDAN-COA-XGBoost	9.10%	11.38%	8.96%	9.68%
	VMD-COA-XGBoost	vs.	Hampel-VMD-COA-XGBoost	9.96%	9.58%	4.65%	8.70%
	Hampel-CEEMDAN-COA-XGBoost	vs.	Hampel-VMD-COA-XGBoost	26.13%	29.38%	37.79%	25.00%

Table 10. Comparison of prediction results with other models.

Dataset	Model	RMSE
SCFI	LSTM	17.62
SCFI	Proposed model	3.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duan, Y.; Zhang, X.; Wang, X.; Fan, Y.; Liu, K. A Novel Forecasting System with Data Preprocessing and Machine Learning for Containerized Freight Market. Mathematics 2025, 13, 1695. https://doi.org/10.3390/math13101695

AMA Style

Duan Y, Zhang X, Wang X, Fan Y, Liu K. A Novel Forecasting System with Data Preprocessing and Machine Learning for Containerized Freight Market. Mathematics. 2025; 13(10):1695. https://doi.org/10.3390/math13101695

Chicago/Turabian Style

Duan, Yonghui, Xiaotong Zhang, Xiang Wang, Yingying Fan, and Kaige Liu. 2025. "A Novel Forecasting System with Data Preprocessing and Machine Learning for Containerized Freight Market" Mathematics 13, no. 10: 1695. https://doi.org/10.3390/math13101695

APA Style

Duan, Y., Zhang, X., Wang, X., Fan, Y., & Liu, K. (2025). A Novel Forecasting System with Data Preprocessing and Machine Learning for Containerized Freight Market. Mathematics, 13(10), 1695. https://doi.org/10.3390/math13101695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Forecasting System with Data Preprocessing and Machine Learning for Containerized Freight Market

Abstract

1. Introduction

2. Materials and Methods

2.1. Hampel Filter

2.2. Variational Mode Decomposition

2.3. Cheetah Optimization Algorithm

2.3.1. Search Strategy

2.3.2. Sit-and-Wait Strategy

2.3.3. Attack Strategy

2.4. COA-XGBoost

2.5. Research Framework

3. Empirical Research

3.1. Data Description

3.2. Data Normalization

3.3. Evaluation Metrics

3.4. Experimental Setup

3.5. Experiment I

3.6. Experiment II

3.7. Feature Importance Analysis

4. Discussion

4.1. Improvement Percentages

4.2. Comparison of the Proposed Model and Existing Model

4.3. Limitations of the Present Study and Future Work

5. Policy Implications and Conclusions

5.1. Policy Recommendations

5.2. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI