1. Introduction
As the backbone of global trade, maritime transport moves nearly 80% of the world’s cargo due to its extensive network coverage and cost efficiency [
1]. Within this system, tramp shipping (also known as irregular shipping), as opposed to liner shipping with fixed routes and schedules, operates on a non-scheduled, demand-driven chartering basis [
2]. It plays an irreplaceable role by operating on a flexible demand-driven chartering model tailored to the transport of bulk commodities such as iron ore, coal, and grains. This operational flexibility makes tramp vessels the primary carriers of global raw materials and resources. However, the absence of fixed sailing schedules also makes predicting vessel Estimated Time of Arrival (ETA) significantly more challenging than in the highly structured liner shipping sector.
It is particularly important in tramp shipping to accurately predict ETA because of its direct implications for cargo readiness, port resource planning, and charter party performance. Even minor inaccuracies in arrival time can delay cargo operations, disrupt berthing assignments, increase storage and demurrage costs, and weaken supply chain coordination. Contractual risks further heighten the importance of ETA accuracy. Under the Laydays and Canceling Date (LAYCAN) clause defined in Article 97 of China’s Maritime Code, a vessel must arrive at the designated port before the agreed canceling date; otherwise, the charterer may cancel the contract. Therefore, ETA deviations in tramp shipping are not only operationally disruptive but may also directly jeopardize charter party fulfillment, highlighting the centrality of arrival-time controllability in maritime commercial practice.
Furthermore, uncertainty in ETA can also trigger cascading disruptions throughout port logistics systems, leading to substantial economic losses. Studies indicate that even in the well-structured container liner sector, discrepancies between Actual Time of Arrival (ATA) and scheduled ETA can range from 30 to 40 h [
3]. In the more unpredictable tramp bulk segment, deviations are often larger and produce more severe consequences. Port operations—such as berth allocation, equipment deployment, manpower scheduling, and yard planning—are typically the first to be impacted [
4,
5]. Large ETA deviations result in idle pre-allocated resources, unexpected vessel clustering, reduced berth turnover, and ultimately port congestion.
At the strategic level, the Development Research Center of the State Council (DRC) observes that global supply chains are undergoing a shift towards a more regionalized and multipolar structure, posing new challenges for maritime logistics [
6]. Liner shipping, with its fixed routes and schedules, offers economies of scale but lacks the flexibility to adjust promptly to evolving regional trade patterns. In contrast, tramp shipping provides strong adaptability due to its flexible voyage planning, but its inherent schedule irregularity leads to weak ETA predictability and reduced port coordination efficiency. Consequently, reliable ETA prediction remains a persistent challenge for the maritime sector, despite International Maritime Organization (IMO) requirements that vessels report ETA in advance [
7].
In this context, digital transformation offers a promising path forward. Leveraging the Internet of Things and artificial intelligence technologies enables the development of dynamic, data-driven ETA prediction models that enhance the controllability of tramp vessel operations. This study focuses on ETA prediction for bulk carriers operating on a long-distance route from Port Weipa, Australia, to Qingdao, China. To address the challenges posed by long voyage durations and high operational uncertainty, a stacking ensemble machine-learning framework is proposed. Using historical Automatic Identification System (AIS) trajectory data combined with multiple voyage-related features, the model conducts multi-feature fusion to generate high-precision ETA predictions. The main innovative contributions of this study are as follows:
The study introduces a robust ETA prediction model that integrates Light Gradient-Boosting Machine (LightGBM), Extreme Gradient Boosting (XGBoost), and Random Forest (RF) through a Linear Regression meta-learner. This framework is specifically tailored to the operational characteristics of long-haul bulk carriers, overcoming the limitations of single-model approaches commonly used in prior research.
- 2.
Construction of a comprehensive multi-dimensional AIS feature system.
A systematic feature engineering pipeline is developed, incorporating baseline, temporal, speed-related, course-related, static, and historical behavior features. This six-category, 72-feature set captures both instantaneous navigation states and longer-term behavioral patterns, enabling richer ETA inference than existing studies.
- 3.
Extensive performance evaluation and interpretability analysis.
Through multi-metric benchmarking, spatial prediction assessment, residual analysis, and feature importance interpretation, the study demonstrates that the stacking model achieves superior accuracy (Mean Absolute Error (MAE) 3.30 h), stability, and robustness. The meta-learner’s weighting strategy is analyzed, revealing how complementary error structures among base models enhance ensemble performance.
- 4.
Operational insights for maritime logistics and port management.
By reducing ETA uncertainty by approximately 75% compared with conventional benchmarks, the proposed model provides substantial practical value for berth scheduling, resource allocation, and supply chain coordination in long-distance bulk shipping.
The remainder of this paper is structured as follows.
Section 2 reviews relevant literature.
Section 3 describes the methodology, including data preprocessing, feature engineering, model construction, and the evaluation criteria.
Section 4 presents the experimental results.
Section 5 discusses the limitations of the study and proposed directions for future research. Finally,
Section 6 concludes by summarizing the key findings and implications.
2. Literature Review
The core focus of this paper concerns the application of stacking models, ETA prediction methods, the practical challenges of ETA in port planning, and the factors influencing ETA accuracy. Accordingly, the literature is reviewed from these four perspectives to identify existing research gaps.
2.1. Application of Stacking Models
Stacking represents an advanced ensemble learning technique in which predictions from multiple heterogeneous base learners are combined through a meta-learner. This method has demonstrated superior performance in areas involving high-dimensional, nonlinear, and multi-patterned data.
In environmental science, stacking is used to synthesize multi-source geospatial information for risk assessments. Shojaeian et al. developed a hybrid stacking–Principal Component Analysis (PCA) system incorporating six models to achieve highly accurate flood susceptibility mapping [
8]. In civil engineering, stacking has shown strong predictive capability in structural analysis and signal processing tasks [
9]. Shu et al. introduced a stacking framework optimized via Bayesian Optimization (BO) that significantly improved the prediction accuracy of reinforced concrete shear capacity [
10]. In healthcare, Nguyen and Byeon demonstrated the utility of stacking combined with Local Interpretable Model-agnostic Explanations (LIME) for diagnosing depressive symptoms in Parkinson’s patients, addressing challenges arising from overlapping clinical features [
11]. In the energy sector, Cao et al. formulated an LSTM–Informer hybrid backed by stacking algorithms for multi-timescale photovoltaic power forecasting, resulting in substantial improvements in prediction reliability [
12].
Collectively, these studies highlight stacking’s ability to integrate complementary strengths across models, reduce overfitting, and enhance robustness—properties particularly valuable for maritime ETA prediction, where data are noisy, highly nonlinear, and influenced by numerous external factors.
2.2. ETA Prediction Methods in the Shipping Industry
Research on vessels’ ETA prediction generally falls into two methodological categories: trajectory-based models and feature-based regression models.
Trajectory-based approaches estimate ETA by first predicting a vessel’s future sailing path and then converting that trajectory into travel time [
13]. For example, Alessandrini et al. applied Dijkstra’s algorithm to derive optimal navigation routes [
14], while Wu et al. developed a multi-scale visibility graph method suitable for autonomous long-distance navigation [
15]. Subsequent studies incorporate more complex navigational and environmental factors. For instance, Park et al. embedded AIS data into a Reinforcement Learning (RL) framework and used Bayesian sampling to estimate speed over ground (SOG), a key input for converting geometric routes into time estimates [
16]. Ogura et al. proposed a two-stage scheme explicitly accounting for weather impacts on routing and speed [
17], and Li et al. introduced deep RL with artificial potential fields to enhance dynamic routing performance [
18]. While effective for structured navigation scenarios with predictable sailing corridors, these methods depend heavily on accurately forecasting the entire future trajectory. This requirement compounds prediction errors and renders such approaches unsuitable for tramp shipping, where routing flexibility and operational variability undermine trajectory predictability.
Feature-based regression methods, by contrast, treat ETA prediction as a supervised learning task that maps extracted features to arrival times [
19]. Early efforts using logistic regression, Classification and Regression Trees (CART), and RF relied primarily on static attributes or coarse environmental variables [
20,
21]. However, these static models perform poorly in dynamic maritime environments, motivating the shift toward machine learning algorithms capable of exploiting the continuously updated position, speed, and heading information provided by AIS. Research has since evolved toward dynamic AIS-driven modeling, employing algorithms ranging from Support Vector Machines (SVM) and Neural Networks (NN) to learning architectures [
22]. Noman et al. showed that Gated Recurrent Units (GRU)-based recurrent models effectively capture temporal dependencies and outperform Gradient-Boosting Decision Trees (GBDT) and Multi-layer Perceptron (MLP) for inland ETA prediction [
23]. Bourzak et al. further compared MLP, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNNs), and Transformer architectures, identifying Bidirectional Long Short-Term Memory (BiLSTM) as the most effective for sequence-based ETA estimation [
24].
Despite these advances, existing methods remain limited in their applicability to long-distance tramp bulk carriers, whose voyages are far longer, less structured, and influenced by more volatile operational conditions than those of liner vessels. This highlights a need for modeling frameworks that are robust to nonlinearity, high variability, and sparse external information.
2.3. ETA Situation in Port Operations
Within the operational framework of ports, the Estimated Time of Arrival (ETA) of vessels serves as a fundamental input for critical decision-making processes, including berth allocation, resource scheduling, and labor planning. The prevailing practice is that ports depend almost exclusively on ETA information provided by the incoming vessels themselves. This information is typically communicated through established channels such as ship agents, direct emails, or standardized messaging within AIS transmissions [
20,
25]. The content of these ETAs is often derived from manual estimates made by the ship’s master, based on experience and prevailing conditions at the time of reporting [
13].
This operational model, however, is characterized by inherent uncertainty. The dynamic and complex nature of maritime voyages means that the initially reported ETA is frequently subject to revision. Factors such as adverse weather, delays at preceding ports, or changes in sailing speed necessitate continuous updates, making the ETA a variable rather than a fixed parameter during a vessel’s journey [
13,
20]. Consequently, this variability directly impacts port planning efficiency.
The reliance on these volatile, externally provided ETAs presents a well-acknowledged operational scenario for port authorities. The discrepancy between planned and actual arrival times can disrupt meticulously crafted schedules. This often manifests in two contrasting inefficiencies: under-utilization of resources (e.g., quay cranes, pilots, and stevedores standing idle due to delays) or over-congestion at the terminal when multiple vessels arrive earlier than anticipated [
26,
27]. Such inaccuracies can extend beyond terminal operations, ultimately hindering port competitiveness and the efficiency of interconnected multimodal transport chains [
28]. Thus, the management of ETA uncertainty is a recognized and persistent aspect of daily port logistics.
This prevailing situation underscores a clear gap between existing operational practices and the level of information reliability required for optimized, resilient port planning. It establishes a direct practical imperative for the development of more accurate, stable, and data-driven ETA prediction methodologies [
13,
27].
2.4. Factors Affecting Vessel ETA
There is a broad consensus in the literature that ETA is shaped by a complex interplay of vessel characteristics, operational decisions, environmental conditions, and route-specific factors [
29]. Based on an extensive synthesis of prior research, these determinants can be grouped into the major categories summarized in
Table 1. Within the Macro route structure category, a precise operational definition of the voyage endpoint (e.g., port boundaries or anchorage areas) is crucial. Ambiguity in this geographical delimiter directly affects the calculation of the final leg’s duration, impacting ETA accuracy. Therefore, standardizing these definitions is an important prerequisite for reliable prediction [
30].
For ocean-going tramp bulk carriers, these factors are amplified by several intrinsic operational characteristics—such as the absence of fixed routes or schedules or predetermined routes, frequent market-driven changes in loading/discharging ports, diverse navigational environments across long-distance routes, and limited transparency regarding commercial decisions. Such variability results in highly nonlinear and unpredictable ETA behavior, posing significant challenges for data-driven prediction models that rely on stable patterns or consistent route characteristics [
27,
44].
2.5. Research Gap
Although substantial progress has been made in ETA prediction research, existing studies largely rely on data from liner shipping, where fixed routes, stable schedules, and recurring voyage patterns provide a supportive environment for trajectory-based and feature-based modeling. These assumptions break down in tramp shipping, where routing is market-driven rather than predetermined, schedules are irregular or nonexistent, and operational decisions change dynamically during the voyage. As a result, existing models tend to exhibit poor generalization performance when applied to long-distance, variable tramp operations.
Additionally, many recent studies enhance ETA models by incorporating external datasets, such as high-resolution weather fields, port congestion indicators, or commercial operation logs. While such information improves accuracy, it is often unavailable or unreliable for tramp vessels, whose commercial activities are not systematically recorded or disclosed. In contrast, AIS trajectories inherently encode the vessel’s navigational behavior—an integrated reflection of engine settings, environmental forces, operational decisions, and traffic interactions. However, how to systematically mine and fuse these latent signals remains underexplored.
Lastly, although stacking models have demonstrated exceptional performance in diverse engineering fields, their application to long-distance ETA prediction for tramp bulk carriers has not been fully investigated. The potential of stacking to exploit complementary residual structures, enhance robustness, and reduce prediction uncertainty represents an important opportunity.
To address these gaps, this study proposes a tailored stacking ensemble framework designed for the complex, nonlinear, and high-uncertainty operational environment of tramp bulk carriers, aiming to significantly improve the accuracy and reliability of long-distance ETA prediction.
3. Methodology
To address the operational complexity and variability inherent in long-distance tramp bulk carrier voyages, this study develops a stacked ensemble learning framework for ETA prediction using AIS data. Through systematic preprocessing of historical AIS records and targeted feature engineering, the proposed model learns the nonlinear relationships between a vessel’s navigational state and its remaining sailing time, enabling accurate end-to-end ETA forecasting.
As illustrated in
Figure 1, the methodological workflow comprises four core phases:
The data preprocessing phase begins with cleaning raw AIS messages to remove errors, outliers, and duplicate records. Additional filtering ensures temporal continuity and operational relevance, after which vessel trajectories are segmented into individual voyages. This process establishes a high-quality dataset suitable for subsequent modeling.
Subsequently, feature engineering is constructed to construct a comprehensive set of features that capture voyage efficiency, vessel motion dynamics, and spatiotemporal context. These include speed- and course-related descriptors, distance-based indicators, and rolling-window statistics designed to characterize short- and mid-term behavioral patterns. Collectively, these engineered attributes provide the model with rich information regarding sailing behavior and its link to ETA.
Following this, the model construction and training phase establishes a stacked ensemble framework, comprising LightGBM, XGBoost, and RF as base learners. Their outputs are integrated by a Linear Regression meta-learner, which synthesizes complementary predictive patterns and enhances model robustness and generalization. Hyperparameter tuning of all learners is performed to achieve optimal predictive performance.
Finally, performance evaluation is carried out using four complementary metrics: MAE, Root Mean Square Error (RMSE), Symmetric Mean Absolute Percentage Error (sMAPE), and the coefficient of determination (). Together, these indicators evaluate both absolute error magnitude and the model’s explanatory power across diverse operational conditions.
3.1. Data Preprocessing
AIS data often contains noise, discontinuities, missing values, and redundant records resulting from equipment malfunction, signal interference, and transmission delays. To ensure data quality and improve the reliability of subsequent ETA prediction, this study implements a systematic preprocessing pipeline to transform raw AIS messages into a high-quality vessel trajectory dataset. As depicted in
Figure 2, the preprocessing consists of three main components: data cleaning, trajectory completion, and trajectory compression.
3.1.1. Data Cleaning
A four-step cleaning procedure is applied to remove erroneous and implausible records while preserving genuine vessel movements.
The procedure commences with time standardization, wherein all timestamps are converted to the standard format “YYYY-MM-DD hh:mm:ss”. Records with missing or invalid timestamps are removed to ensure temporal consistency. After this step, the number of retained AIS records (
) in
Figure 2 is 835,203. Subsequently, duplicate removal is executed by applying two rules: (a) For identical Maritime Mobile Service Identity (MMSI) timestamp pairs, only the earliest message is retained; (b) Records where SOG exceeds 1 knot while reported coordinates remain unchanged from the previous message are identified as anomalous duplicates and removed. After removing duplicates,
.
The cleaning process continues with physical threshold filtering, discarding data points that fall outside the valid physical ranges defined by maritime operation standards (
Table 2), including invalid positions, unrealistic speeds, and out-of-range vessel dimensions. After filtering,
.
The final step addresses the removal of trajectory jump points caused by AIS noise. A composite detection rule is applied: (a) instantaneous speeds exceeding 16 knots are removed; (b) heading changes between consecutive points greater than 180° are filtered out; (c) large spatial deviations inconsistent with vessel dynamics are eliminated. After removing jump points, the final retained dataset consists of decreases to 822,650 high-quality records.
Overall, 98.5% of the original dataset is preserved, indicating that the cleaning rules effectively removed invalid data while maintaining the integrity of vessel trajectories.
Figure 3 compares the AIS trajectories before and after cleaning. A zoomed-in view of a representative region (125–135° E, 5° S–10° N, marked by blue rectangles) is included to demonstrate specific examples of cleaning, where overlapping and erroneous points are most prevalent. This comparison clearly shows that the cleaning procedure effectively preserves genuine vessel movements while eliminating erroneous data points, thereby ensuring a reliable foundation for subsequent analysis.
3.1.2. Data Completion
AIS signals may contain irregular reporting intervals or missing values, especially in open-ocean segments. To restore temporal continuity and preserve navigational semantics, this study adopts a stratified completion strategy, combining forward–backward filling for static fields with adaptive interpolation for kinematic variables. Static or categorical fields (e.g., MMSI, IMO, vessel dimensions, navigation status) are filled using forward–backward propagation, ensuring consistency without altering vessel identity or voyage descriptors. Dynamic numerical fields (e.g., latitude, longitude, SOG, heading) are reconstructed using adaptive linear interpolation when the interval between adjacent AIS messages exceeds a predefined threshold. For two valid AIS points
and
, the interpolated value
at any intermediate time
is calculated by Equation (1):
Interpolation significantly improves temporal completeness. For example, in the representative trajectory (MMSI: 538005339 [
49]), the number of records increases from 7399 to 9357 after interpolation (
Figure 4), reconstructing otherwise missing navigational segments. This yields uniformly spaced, continuous trajectories suitable for extracting rolling-window features (e.g., average SOG or acceleration), critical components for accurate ETA prediction.
3.1.3. Trajectory Compression
To mitigate data redundancy and minimize the risk of model overfitting, this study adopts a dual-criterion trajectory compression method that integrates geometric feature preservation with temporal coverage guarantee [
51]. In the spatial dimension, the Douglas–Peucker (DP) algorithm is applied to retain key geometric points whose perpendicular deviation from the connecting chord exceeds a predefined tolerance. This ensures that essential navigational behaviors, such as course alterations and maneuvering segments, are preserved. The perpendicular distance
is computed using Equation (2):
where
is the evaluated point and
and
denote the endpoints of the chord.
In the temporal dimension, a fixed-interval sampling scheme is introduced to maintain temporal consistency by uniformly selecting points throughout the voyage. The final compressed trajectory is obtained by merging the spatially and temporally selected points, thereby preserving both geometric structure and time-series integrity.
Following parameter calibration, the DP tolerance is set to (approximately 1.1 km), and the temporal sampling interval to min. A non-recursive implementation of the DP algorithm improves computational efficiency for large-scale datasets. All static vessel attributes and dynamic motion parameters are maintained to avoid information loss.
Across all voyages, only 8.23% of the original AIS points are retained, resulting in a compression rate of 91.77%. This substantial compression rate primarily reflects the removal of redundant data commonly found in long-distance voyages, rather than the exclusion of features vital for ETA prediction. The dual-criterion method is specifically designed to preserve the geometric and temporal anchors that define a voyage profile. As shown in
Figure 5, while redundant points in straight-line segments are removed, all turning and maneuvering points—critical for understanding vessel behavior—are retained. This approach ensures that the compression process improves data quality for model training by increasing information density, while maintaining the integrity of the key features necessary for ETA prediction. The resulting compact representation provides a high-quality, informative input for downstream ETA modeling.
3.2. Feature Engineering Extraction
Feature engineering plays a pivotal role in transforming raw AIS data into informative variables that capture underlying navigation patterns and support accurate ETA prediction [
52,
53]. In this study, a comprehensive feature set is systematically designed to reflect the operational characteristics of long-distance tramp bulk carriers. The constructed features are organized into six categories, as summarized in
Appendix A,
Table A1.
Specifically, Baseline features offer the fundamental physical information directly related to transit time estimation. Temporal features encode periodic patterns, such as seasonal patterns, weekday–weekend differences, and holiday effects, that may influence vessel behavior and port operational efficiency. Course-related features quantify directional alignment and course stability, enabling the model to detect route deviations arising from traffic avoidance, meteorological disturbances, or other navigational decisions.
Speed-related features constitute the largest group and characterize both short-term and long-term vessel motion dynamics. These include instantaneous acceleration and rolling statistics of SOG computed over 6, 12, 24, and 48 h windows, allowing the model to distinguish stable open-sea cruising from low-speed sailing in congested or restricted waters. Static features incorporate vessel geometry, capturing inherent differences in maneuverability and ensuring that predictions remain consistent with physical constraints. Finally, historical behavioral features summarize long-term navigational tendencies unique to each MMSI, enabling the model to exploit vessel-specific patterns learned from past voyages.
Through this structured feature engineering process, a total of 72 features are generated across six dimensions. Collectively, these features capture instantaneous navigation states, temporal context, route-following behavior, dynamic motion patterns, vessel-specific attributes, and historical sailing habits. This multi-layered representation provides a robust and information-rich input foundation for the stacked ensemble ETA prediction model.
3.3. Model Construction
3.3.1. Stacking Model
To address the performance limitations and instability often observed in single-model ETA prediction approaches, this study constructs a two-stage stacking ensemble framework tailored to the characteristics of long-distance tramp bulk carriers. As illustrated in
Figure 6, the first layer integrates three heterogeneous base learners (LightGBM, XGBoost, and RF) to capture diverse learning patterns and error structures. The second layer employs a Linear Regression meta-learner, which fuses the base-model outputs to generate the final ETA prediction.
The proposed stacking ensemble framework is implemented using Python 3.9. The implementation employs widely used machine learning libraries: scikit-learn (version 1.2.2) for constructing the modeling pipeline and for the Linear Regression meta-learner; XGBoost (version 1.7.6) and LightGBM (version 4.1.0) as the gradient-boosting base learners; and the Random Forest Regressor from scikit-learn. The development environment utilized is PyCharm 2025.1, and all experiments are conducted on a workstation equipped with a 12th Gen Intel(R) Core (TM) i5-12500H (3.10 GHz) processor and 16 GB of RAM.
The workflow is organized into two phases: a training phase and a prediction phase. During training, BO, combined with 3-fold cross-validation and MAE as the evaluation criterion, is applied to tune the hyperparameters of the base learners (
Figure 7). To prevent data leakage, a 5-fold cross-validation scheme is then used to generate out-of-fold predictions, which constitute the meta-train matrix
. Pair with the true labels
, the matrix
is used to train the Linear Regression meta-learner.
In the prediction phase, each trained base model produces a prediction for the test set. The final base-model outputs are obtained by averaging predictions across folds, forming the meta-feature matrix , which is subsequently passed to the trained meta-learner to yield the ensemble’s ETA estimates.
3.3.2. Base Model and Meta-Model Selection
After extensive experimentation and comparative evaluation, this study selects LightGBM, XGBoost, and RF as the base learners due to their complementary strengths and proven robustness in large-scale, nonlinear, and high-dimensional learning tasks—characteristics aligned with AIS-based ETA prediction.
LightGBM achieves high efficiency and strong baseline accuracy through histogram-based computation and leaf-wise tree growth [
54]. XGBoost augments this with second-order gradient optimization and explicit regularization, enabling more precise loss minimization [
55]. RF, based on bootstrap aggregation, introduces model diversity and enhances robustness by reducing variance, effectively counterbalancing the overfitting tendencies of boosting algorithms [
56,
57].
Collectively, these models represent complementary learning biases (efficiency, precision, and robustness), forming a diversified base layer capable of capturing complex data patterns.
A Linear Regression model is selected as the meta-learner to integrate the base learners’ predictions [
36]. Its simplicity reduces the risk of overfitting given the low dimensionality of meta-features, while its convex optimization yields interpretable coefficients that clarify the relative contribution of each base model.
Section 4 further analyzes these coefficients and their implications.
3.3.3. Model Training and Optimization Strategy
To ensure robust predictive performance, this study incorporates a combined optimization strategy involving BO and cross-validation. BO serves as the primary hyperparameter tuning method, efficiently navigating the parameter space by updating search direction from prior evaluations. A limited evaluation budget of 50–100 iterations is configured for BO. This range is determined based on the dimensionality of our hyperparameter search space (comprising 7 key hyperparameters), ensuring a thorough exploration while maintaining computational tractability. This budget strikes a balance between efficiency and performance, addressing computational scalability for future expansion. The primary computational cost scales with the number of evaluations (each requiring model training), not with the complexity of the BO meta-algorithm itself. While per-evaluation training time will increase with larger multi-route datasets, the BO overhead remains negligible. Moreover, evaluations are parallelizable, and with a fixed iteration count, the overhead remains manageable in large-scale applications. With 3-fold cross-validation and MAE minimization as the objective, the algorithm optimizes key hyperparameters such as learning rate, tree depth, and the number of estimators for LightGBM and XGBoost, as well as tree quantity and depth for RF.
A rigorous 5-fold cross-validation scheme is strictly employed during meta-feature generation to avoid data leakage. All meta-features provided to the meta-learner are out-of-sample predictions generated from an unseen fold, ensuring that the meta-learner is trained on unbiased and representative information.
Together, the BO-driven hyperparameter tuning and the multi-level cross-validation framework provide a robust foundation for constructing a high-performing and reliable stacking ensemble model for long-distance ETA prediction in tramp shipping.
3.4. Evaluation Criteria
To comprehensively evaluate the performance of the ETA prediction model, four complementary evaluation metrics are employed: MAE, RMSE, sMAPE, and
. Their mathematical formulations and interpretive meanings are summarized in
Table 3.
Collectively, these metrics provide a multi-dimensional assessment of model quality, capturing absolute and squared-error behavior, scale-independent proportional errors, and overall explanatory strength. All experimental results reported in
Section 4 are evaluated within this unified framework, which was predefined here, to ensure rigorous and comparable performance benchmarking.
4. Experiments and Results
To evaluate the capability of AIS-based models for long-distance ETA prediction, this study selects the shipping corridor between Port Weipa (Australia) to Qingdao Port (China). This 3500-nautical-mile route traverses the Coral Sea, the Indonesian Archipelago, and the South China Sea areas characterized by monsoon variability, complex navigational conditions, and high vessel density. These characteristics make the route an ideal testbed for assessing the performance of a data-driven stacking ensemble framework. More importantly, it serves as a representative case for long-distance tramp bulk shipping, encapsulating its core attributes: market-driven irregular schedules, absence of fixed service frequency, and operations across diverse oceanic regions. Consequently, a model developed and validated on this route is inherently designed to address the fundamental challenges of the domain.
4.1. Dataset
The dataset is sourced from the AIS database of the Shipping Big Data Platform at Jimei University and includes bulk carrier voyages between Weipa and Qingdao from 17 December 2022 to 31 December 2023. A total of 86 complete voyages are retained after preprocessing. The vessels operating on this route are mainly Capesize and Panamax bulk carriers, whose principal dimensions—matched MMSI codes and the Clarksons World Fleet Register—range from 180 to 292 m in length and 28 to 45 m in breadth. The spatial distribution of these trajectories is illustrated in
Figure 8.
Following the preprocessing and feature engineering procedures outlined in
Section 3, model inputs are represented as a set of
-dimensional feature vectors
,
, where each vector encodes the vessel’s navigational state at a specific timestamp based on the six feature groups described in
Section 3.2. The corresponding prediction targets are defined as
, where each
represents the remaining sailing time, defined as the duration required for the vessel to reach Qingdao from its current location.
To support rigorous performance evaluation, the dataset is randomly partitioned into a training set () and a test set () using a 4:1 split ratio. The training set is used to fit and optimize the base learners and the meta-learner, while the test set is reserved exclusively for out-of-sample validation. Importantly, all AIS records belonging to the same voyage are assigned entirely to either the training or test set to eliminate data leakage and ensure that evaluation results reflect realistic prediction conditions for unseen voyages.
4.2. Comparative Analysis
4.2.1. Performance Comparison Across Evaluation Criteria
The predictive performance of the stacking ensemble is benchmarked against a baseline model and its three constituent base learners. Historical Averaging (Baseline Model), a non-learning approach, establishes a simple benchmark for ETA prediction [
58]. For each vessel (identified by MMSI), the model computes the mean of the maximum remaining sailing times across all its voyages in the training set. This vessel-specific historical mean is then used as the ETA prediction for all the vessel’s voyages in the test set, without accounting for real-time navigational dynamics. LightGBM, XGBoost, and RF, which function as the base learners within the stacking ensemble, are also evaluated as standalone machine learning models. This comparison quantifies the performance improvement achieved through the integration of the meta-learner, in contrast to the use of any individual base learner.
Figure 9 presents the performance of all models across four complementary evaluation metrics: MAE (with standard deviation), RMSE, sMAPE, and
. To highlight the superiority of the proposed stacking model, its performance on each metric is indicated by a red dashed line in the respective subplot, against which all other models are compared. The error bars on the MAE represent the standard deviation, emphasizing the stability of predictions, which is a key focus of our evaluation. These visual comparisons are further supported by the numerical results reported in
Table 4.
Figure 9 and
Table 4 collectively demonstrate that the stacking ensemble model consistently outperforms all benchmark and individual models across all dimensions of evaluation, including accuracy, stability, goodness-of-fit, and large-error control.
In terms of accuracy, the stacking model achieves an MAE of 3.30 h, representing a 74.7% improvement relative to the historical averaging benchmark (13.2 h) and outperforming all individual models (XGBoost: 3.72 h; LightGBM: 4.12 h; RF: 4.42 h). Although the historical averaging method yields a slightly lower sMAPE (4.44%) than the stacking model (4.52%), this advantage is misleading, as its constant-value predictions artificially suppress relative error. Among learning-based models, the stacking ensemble achieves the best sMAPE, while XGBoost exhibits the highest relative error (6.71%).
Regarding stability, the stacking model records the lowest MAE_std (4.80 h), indicating the most consistent predictive behavior. RF, by contrast, displays substantial volatility (MAE_std = 8.60 h), even exceeding its own MAE, reflecting limited robustness.
For goodness-of-fit, the stacking model achieves an of 0.9902, explaining 99.02% of the variance and approaching the theoretical upper limit. Both LightGBM and XGBoost also perform strongly (>0.98). The historical averaging method, however, yields a negative (−0.0593), indicating performance worse than a simple mean-based predictor.
In terms of large-error control, the stacking model also obtains the lowest RMSE (4.54 h), while the historical averaging method produces the highest RMSE (17.26 h), reflecting frequent and substantial errors. Among the individual learners, XGBoost performs best, achieving a favorable RMSE-to-MAE ratio, an advantage further strengthened within the stacking ensemble.
Across all evaluation dimensions, the stacking ensemble delivers optimal or near-optimal performance, validating the effectiveness of heterogeneous learners through a meta-learner and demonstrating its reliability for long-distance vessel ETA prediction.
4.2.2. Comparative Analysis of Prediction Accuracy in Spatial Distribution
Figure 10 presents scatter plots comparing predicted and true ETA values for each model, offering a spatial perspective on prediction accuracy and error dispersion. The historical averaging method is excluded due to its degenerate constant-value output. The reference line
indicates ideal predictions, while the ±20% and ±50% error bands provide intuitive accuracy thresholds.
Key observations include: (i) LightGBM and XGBoost show similar performance, with most predictions concentrated within the ±20% band. Slight dispersion appears when the remaining time exceeds 200 h (
Figure 10a,b); (ii) RF exhibits markedly higher vertical dispersion, particularly between 100 and 200 h, with several predictions falling outside the ±50% band (
Figure 10c), reflecting weaker stability; (iii) The stacking ensemble demonstrates the tightest clustering around the ideal line, with uniformly distributed residuals, fewer outliers, and strong alignment across the full prediction spectrum (
Figure 10d).
These spatial distribution results confirm that the stacking model provides the most stable and reliable ETA predictions, thereby supporting its practical deployment for voyage monitoring, route planning, and navigational decision-making.
4.2.3. Comparative Analysis of Residual Error Structures
Residual plots in
Figure 11, mapping predicted values (
x-axis) against residuals (true minus predicted values,
y-axis), further characterize model error behavior by comparing predicted values against residuals. These diagnostics help identify systematic bias, heteroscedasticity, and variability patterns.
The following observations emerge: (i) All models display residuals centered around zero, with no obvious structural patterns, satisfying basic regression assumptions. (ii) LightGBM displays generally uniform residuals, but with slight overestimation bias (mean = 1.12 h) and moderate variability (std = 5.51 h). Occasional large errors (>±20 h) occur for long-range predictions (
Figure 11a). (iii) XGBoost exhibits slightly lower variability (std = 5.00 h) and more concentrated residuals, although errors increase at higher predicted remaining times (
Figure 11b). (iv) RF displays the highest variability (std = 6.84 h) and occasional extreme negative errors (below −40 h), indicating instability and heteroscedasticity (
Figure 11c). (v) The stacking ensemble provides the most consistent residual distribution, with minimal bias (mean = 0.92 h) and the lowest variability (std = 4.44 h). Residuals cluster tightly within ±5 h across all prediction horizons (
Figure 11d).
In summary, these findings confirm that the stacking ensemble significantly enhances error stability and reduces residual variance, reinforcing its suitability for long-distance ETA prediction under complex navigational conditions.
4.2.4. Feature Importance Analysis and Model Interpretability
To gain insight into model decision mechanisms, feature importance scores derived from the trained LightGBM model are analyzed. The aggregated feature-category importance values, presented in
Figure 12, exhibit a coherent and physically interpretable structure.
The three most influential categories are velocity-related features (22.20%), course-related features (21.27%), and baseline features (20.05%). Jointly, they constitute the primary predictive pillars of the model. These findings reaffirm established maritime reasoning, whereby ETA is governed mainly by current vessel speed, course stability, and remaining distance. Historical behavioral features (19.87%) also play a substantial role by capturing short-term fluctuations in speed and heading, providing contextual cues about navigational rhythm and informing near-term ETA adjustments. This parallels real-world maritime operations, where captains consider both instantaneous motion and recent vessel behavior when assessing ETA.
Historical behavioral features (19.87%) contribute meaningfully by capturing short-term fluctuations in speed and heading, which offer contextual insights into the navigational rhythm. However, in long-distance tramp shipping, their predictive utility is lower than that of real-time SOG and distance. This is due to the significant variability of tramp shipping routes and port calls, which diminishes the consistency of vessel-specific patterns across voyages. Over long distances, macro-scale progress (captured by real-time features) becomes the dominant factor in ETA prediction. Nonetheless, historical features still provide valuable information for near-term ETA adjustments, consistent with real-world maritime operations where factors such as current motion and recent vessel behavior are continuously assessed by ship captains.
Operationally, the reduction in ETA error from 13.02 h (baseline) to 3.30 h (stacking) significantly narrows uncertainty for voyages exceeding 3500 nautical miles. Such improvements directly enhance berth allocation, equipment planning, manpower scheduling, and inland logistics coordination—ultimately strengthening the resilience and efficiency of maritime supply chains.
4.3. Interpretation of the Meta-Learner Weighting Mechanism
To further elucidate the internal decision-making process of the stacking ensemble, this section analyzes the optimal weighting coefficients estimated by the meta-learner.
Table 5 represents the contribution of each base model to the final ensemble prediction. Several key insights emerge from this analysis. First, the stacking ensemble does not perform a simple average of the base learners. Instead, it constructs an optimized linear combination that capitalizes on the complementary predictive behaviors of the constituent models. Notably, as shown in
Table 5, all estimated weighting coefficients are statistically significant at a highly significant level (
p < 0.001), which confirms that the allocation of the meta-learner—assigning a dominant positive weight to LightGBM, a corrective negative weight to XGBoost, and a supplementary positive weight to RF—is robust and not due to random chance. Second, although XGBoost yields the best individual performance (MAE = 3.72 h,
= 0.9807), the meta-learner assigns the highest positive weight to LightGBM (
= 0.984) and a negative weight to XGBoost (
= –0.187). This allocation indicates that the meta-learner does not merely select the strongest individual model; rather, it prioritizes the configuration that yields the best error synergy and generalization performance.
A plausible explanation is that LightGBM, while slightly less accurate than XGBoost in isolation, produces more stable and consistent residual patterns, making it a robust predictive baseline. In contrast, systematic deviations present in XGBoost’s predictions exhibit structured relationships with LightGBM’s residuals. The meta-learner exploits this relationship by treating XGBoost as an error-correction signal. Assigning a negative coefficient enables the ensemble to reverse the direction of these deviations and fine-tune LightGBM’s predictions. Meanwhile, RF ( = 0.202) contributes positively by supplying additional complementary information that correlates with the target variable.
In conclusion, the meta-learner’s weighting pattern demonstrates a key advantage of stacking ensembles: their ability to integrate models with diverse error structures to produce superior predictions. The coordinated interplay among the base models—led by LightGBM’s stability, enhanced by XGBoost’s corrective deviations, and supported by RF’s complementary signals—explains the ensemble’s heightened generalization capability. This asymmetric and at times counterintuitive weighting scheme highlights the sophistication of the stacking strategy and its effectiveness in achieving high-precision ETA prediction for long-distance tramp vessels.
5. Discussion
This study proposes a data-driven ETA prediction framework for tramp shipping by integrating historical AIS trajectory data with a tree-based stacking ensemble model. The empirical validation on the Weipa-Qingdao iron-ore route demonstrates significant improvements in ETA prediction, reducing the Mean Absolute Error (MAE) by 74.7% compared to the historical averaging benchmark. While the framework achieves high predictive accuracy and stability across various evaluation metrics (e.g., RMSE, sMAPE, R2), the following limitations must be addressed in future research.
Firstly, although the model performs well overall, it struggles with accurately predicting ETA during the final port-approach phase. Specifically, the model’s relative error during this phase (sMAPE = 29.08%) indicates difficulty in capturing critical port-access processes, such as pilotage delays, queuing, and weather-induced traffic restrictions, using AIS data alone. This limitation highlights the need to integrate additional data sources—such as meteorological conditions, real-time fairway status, and port congestion indicators—to improve near-port prediction accuracy.
Secondly, this study focuses on a single long-distance route (Weipa-Qingdao), which serves as a representative case for long-distance tramp bulk shipping. However, the generalizability of the model to other routes, vessel types, and geographical regions remains untested. Future work should evaluate the model’s performance across a broader range of routes with varying operational characteristics. This would involve adapting the feature engineering schema to capture route-specific geography while maintaining the flexibility of the framework to handle diverse operational conditions.
Furthermore, the model’s ability to generalize across multi-port voyage structures and different vessel types needs to be explored. For example, ships with different operational profiles (e.g., container ships versus bulk carriers) may exhibit distinct navigational behaviors that could influence ETA prediction. Future studies could assess how well the model adapts to these variations and fine-tune the feature extraction process accordingly.
Another key limitation concerns the computational efficiency of the model in real-time applications. Although the study demonstrates the framework’s promising performance, operational deployment would require addressing engineering challenges such as model lightweighting and improving computational efficiency. This includes optimizing the model for integration with Vessel Traffic Services (VTS), port community systems, and commercial fleet-management platforms, ensuring scalability and responsiveness in dynamic operational environments.
Finally, there is a need for further refinement of the interpretability of the model. While the meta-learner’s weighting strategy has been analyzed, deeper exploration of how specific feature combinations influence ETA predictions could provide valuable insights into the operational decision-making process. Future research could explore methods for improving the transparency of the model and how these insights can be translated into actionable strategies for port and fleet operators.
6. Conclusions
By integrating historical AIS trajectory data with a tree-based stacking ensemble model, this study proposes a data-driven ETA prediction framework for tramp shipping. Using the Weipa–Qingdao iron-ore route as a representative case, the model reduces the MAE from 13.02 h (historical averaging benchmark) to 3.30 h—an accuracy improvement of approximately 74.7%. Across all evaluation metrics, including RMSE, sMAPE, and , the stacking ensemble consistently achieves optimal or near-optimal performance, demonstrating high predictive accuracy, strong stability, and robustness generalization under realistic tramp-shipping conditions.
Beyond performance improvement, the study provides interpretable insights into the determinants of ETA. Feature importance analysis reveals that velocity-related, course-related, and baseline features collectively form the core predictive foundation, accounting for more than 63% of total importance. This ranking is fully aligned with maritime navigation principles, wherein ETA is primarily governed by a vessel’s speed, course stability, and remaining distance. Historical behavioral features also show substantial influence, underscoring the value of recent navigational trends as contextual indicators of short-term motion dynamics. These findings confirm that AIS trajectories alone contain rich, physically meaningful information sufficient to support high-precision ETA estimation in tramp operations.
The interpretability analysis of the meta-learner further illustrates the sophistication of the stacking mechanism. Although XGBoost exhibits the best standalone accuracy, the meta-learner assigns LightGBM the dominant positive weight (0.984) and XGBoost a negative weight (−0.187). This asymmetric allocation indicates that the ensemble optimizes complementary error structures rather than relying on the numerically strongest single model. LightGBM provides a stable predictive baseline, while systematic deviations in XGBoost outputs act as effective error-correction signals when negatively weighted. RF contributes additional positively correlated patterns. Together, these interactions explain the ensemble’s enhanced predictive power and demonstrate the methodological value of exploiting heterogeneous residual structures.
From an operational perspective, reducing ETA uncertainty from roughly half a day to approximately three hours offers substantial practical benefits. More accurate arrival forecasts enable ports to optimize berth planning, equipment utilization, and labor scheduling, while inland logistics networks gain more reliable lead time for coordinating downstream transport and inventory operations. The proposed method, therefore, enhances predictability and operational resilience across the maritime supply chain, particularly for bulk trades characterized by long distances, variable routing conditions, and limited schedule regularity.
Overall, this study demonstrates that high-quality ETA prediction for tramp shipping can be achieved through deep extraction of AIS-derived navigational signals combined with a carefully designed stacking ensemble. The findings contribute both methodological advances and practical insights that support more intelligent, data-driven maritime logistics management.