1. Introduction
To maintain maritime safety and improve navigation efficiency, China has established Vessel Traffic Services (VTS) centers to monitor vessels in target areas in real time, achieve real-time situational awareness of maritime vessels, and support effective decision-making. As a key technology for hazard early warning, real-time vessel monitoring, and traffic flow estimation, vessel trajectory prediction is of great significance for advancing the intelligence of VTS management systems [
1,
2]. However, vessel trajectory prediction still faces many challenges: first, observed historical vessel trajectories exhibit discontinuities and non-equidistant sampling; second, it is difficult to establish accurate vessel kinematic equations; third, existing methods mainly focus on short-term prediction. Therefore, predicting the medium- and long-term trajectories of target vessels remains a significant challenge, and the academic community has been committed to mining historical vessel navigation data to improve trajectory prediction capabilities [
3,
4].
Existing trajectory prediction methodologies are classified into two categories: probabilistic statistical methods and deep learning methods.
Regarding probabilistic statistical methods, scholars have adopted Kalman filtering [
5,
6], grey prediction [
7,
8], support vector machines (SVMs) [
9,
10,
11], and other approaches for trajectory prediction. Wang et al. [
5] used Kalman filtering and improved Kalman filtering methods to predict vessel trajectories. However, these methods assume a single motion mode and cannot simultaneously account for both motion trends and disturbances. Some scholars [
7,
8] adopted improved grey prediction models for real-time calculation of vessel positions. Xue et al. [
12] combined grey theory and fuzzy theory to propose a priority-ranking method for maritime traffic-safety influencing factors, which is used to guide vessel motion prediction. Other studies [
9,
10,
11] used trajectory similarity to propose an SVM-based trajectory prediction model and to construct a regression predictor for AIS trajectory prediction. Markov chain-based models [
13,
14] predict medium- and long-term vessel trajectories using state transition matrices for position, speed, and course. These probabilistic statistical methods mainly suffer from problems such as assuming a single motion mode in the predicted target and difficulty in adapting to complex motion modes.
With the continuous development of deep neural networks, many scholars have turned to neural network methods such as back propagation (BP), convolutional neural networks (CNNs), long short-term memory (LSTM), gated recurrent units (GRUs), and their variant structures [
15,
16,
17,
18] to mine the rules in historical vessel trajectories and perform vessel motion analysis. Hai et al. [
15] used BP neural networks for trajectory prediction, verifying the feasibility and effectiveness of the model; Bogaerts et al. [
17] adopted CNN units to extract features from traffic data and used LSTM units to realize short-term traffic flow prediction. Liu et al. [
1] proposed a deep learning framework based on LSTM networks, which showed good prediction accuracy and robustness in maritime traffic services; Yang et al. [
19] adopted a bidirectional LSTM (Bi-LSTM) model based on the deep learning framework combined with data denoising technology, effectively improving the performance of vessel trajectory prediction; Liu et al. [
18] constructed a hybrid prediction model of CNN and bidirectional long short-term memory (Bi-LSTM) for the characteristics of vessel navigation trajectories, obtaining the optimal input–output mapping relationship through network training; Shin et al. [
20] proposed a deep learning framework integrating auxiliary tasks and convolutional networks, improving prediction performance with multi-task learning strategies. In addition, deep reinforcement learning methods have been increasingly applied to trajectory prediction. Pei et al. from the Hong Kong University of Science and Technology [
21] explored the feasibility of using the reinforcement learning paradigm to model agent behavior reasoning in autonomous driving scenarios. They modeled the task as a Markov Decision Process, defined the agent’s behavioral intent as a sequence of decisions in a discrete grid world, and drove trajectory prediction by maximizing rewards, which significantly improved prediction confidence and overall performance.
Based on the above analysis, existing trajectory prediction methods still have limitations. Statistical methods, such as grey prediction and Kalman filtering, offer stable models and strong real-time performance. However, they require high-quality data and suffer from lower predictive accuracy when the target’s motion is complex. Deep learning methods, such as LSTMs, construct neural networks with multiple hidden layers that can fit nonlinear time-series data. However, as the prediction horizon increases, the accuracy of vessel trajectory predictions decreases significantly, and they are often used for short-term prediction.
In response to these limitations, this paper explores the feasibility of applying the reinforcement learning paradigm to vessel trajectory prediction and proposes a CNN-PPO deep reinforcement learning method for medium- and long-term maritime target motion prediction. The proposed method regards vessel position variation as the result of sequential navigation decision-making. By learning a navigation policy from historical AIS trajectories, the model predicts future vessel positions through optimal policy inference rather than direct point-wise trajectory fitting. In this way, the vessel trajectory prediction task is transformed into an optimal navigation strategy-solving problem under an MDP framework.
The main contributions of this work are summarized as follows:
A new MDP-based formulation for maritime trajectory prediction is proposed. The vessel trajectory prediction task is modeled as a sequential decision-making process, which avoids the difficulty of explicitly establishing accurate vessel kinematic equations and provides a policy-learning perspective for medium- and long-term prediction.
A multi-channel spatiotemporal trajectory feature matrix is constructed by rasterizing the navigation area and stacking historical navigation-state feature maps. This representation enables the model to capture spatial motion patterns and temporal evolution characteristics from historical AIS trajectories.
An end-to-end CNN-PPO framework with an Actor–Critic architecture is developed. The CNN module extracts trajectory features, while PPO learns an optimal navigation policy by maximizing cumulative long-term rewards, thereby improving trajectory consistency in continuous prediction.
Comprehensive experiments are conducted on a public NOAA AIS dataset. In addition to traditional statistical and conventional neural-network baselines, representative deep learning models, including a GCN and Transformer, are added for comparative analysis. The results demonstrate that the proposed method achieves superior one-hour prediction accuracy and more stable 1–4 h continuous prediction performance.
The computational efficiency, parameter sensitivity, model limitations, and future applicability are further discussed, providing a more complete evaluation of the proposed method for intelligent Vessel Traffic Services (VTSs) and maritime target surveillance applications.
2. Mathematical Model of Trajectory Prediction Problem Based on MDP
Existing vessel trajectory prediction methods primarily rely on fitting and extrapolating historical vessel trajectories to predict the target vessel’s future trajectory, without considering the target vessel’s navigation rules from an optimal decision-making perspective. Therefore, this paper aims to establish a novel vessel trajectory prediction method that treats the target vessel’s position changes as the result of its optimal decision-making. By learning a navigation strategy from the target vessel’s historical trajectory, the method predicts the vessel’s future trajectory using that strategy.
As described, the vessel trajectory prediction process can be characterized as a sequential decision-making process, as illustrated in
Figure 1. At each decision moment, the process consists of three steps: acquiring the current vessel trajectory information, deciding on vessel position changes, and updating the system state information. By formulating the vessel trajectory prediction problem as a Markov Decision Process (MDP), the navigation strategy learning and the mapping from the agent’s states to decision outcomes during prediction can be regarded as the decision-making policy of the MDP. Consequently, the vessel trajectory prediction problem is transformed into an optimal policy-solving problem.
The MDP is uniformly expressed as a tuple:
where
is the state space,
is the action space,
is the state transition probability,
is the immediate reward, and
is the discount factor. All mathematical symbols in this paper follow this unified definition.
Based on the Markov Decision Process, the trajectory prediction problem is described as follows:
The starting time of the prediction is defined as . At each time step , the agent sequentially receives decision commands corresponding to the vessel’s position change, with a fixed time interval of 1 h between consecutive steps. Let L denote the total number of historical trajectory points; then, the end time of the sequential decision-making process is .
- (2)
State
To fully represent the navigation state and explicitly capture spatiotemporal dependencies, a multi-channel trajectory feature matrix is adopted to describe the system state.
The extraction of the multi-channel trajectory feature matrix is inspired by the design of AlphaGo [
22], the Go-playing artificial intelligence program developed by DeepMind. AlphaGo extracts feature maps from the sequential state histories of chess pieces over several consecutive steps and stacks them in the order of moves. By analogy with the Go problem, the maritime navigation area is regarded as a chessboard, the maritime target as a chess piece, and the target’s temporal navigation positions as the chess piece’s move sequence. Accordingly, feature maps are extracted from the target’s position states over several steps. The core idea is to represent the nautical chart in raster format: the raster map size is determined by the potential activity range of the target vessel, and the target’s navigation state is mapped onto it. Taking the position of the target vessel at the previous moment as the center of the raster map, the coordinate position of the target on the raster map at the current moment is calculated based on the relative position change in the target from the previous moment to the current moment. The feature value at this position is set to 1, and all other feature values are set to 0. This raster map is the computed navigation-state feature map, and thus, one navigation-state feature map can be obtained for each pair of adjacent moments, as shown in
Figure 2a. Several navigation-state feature maps computed for sequential trajectory segments are stacked in temporal order to generate the trajectory feature matrix, as illustrated in
Figure 2b.
- (3)
Action
At each decision-making step, the agent is required to output the vessel’s position change based on the current environmental state. Therefore, the action is defined as the longitude and latitude variation from the current step to the next:
The vessel’s trajectory is predicted hourly. Assuming the maximum navigation speed of the vessel is 30 knots (1 knot = 1.852 km/h), the maximum navigation distance of the target vessel in 1 h is 55.56 km. Thus, the action constraint is specified as .
- (4)
Immediate-Reward Function
The goal of this paper is to enable the agent to learn the target vessel’s navigation strategy and apply it to the trajectory prediction scenario. This requires that, during the learning process, the agent’s navigation trajectory closely follow the target vessel’s actual trajectory to obtain the maximum decision reward.
This paper directly observes the target vessel’s trajectory and computes the position error at time
i, thereby learning the underlying navigation strategy from the trajectory data. Define that at time
i, the agent generates an action
, and the coordinate information of the agent at time
i is obtained as follows:
In the process of calculating the position error, the longitude and latitude error between the target vessel’s position and the agent’s position is converted into a distance error. The actual distance between the predicted value
and the true value
is calculated using the Haversine formula:
where
is the position information of the target vessel at time
i, and
is the Earth’s mean radius, taken as 6371.004 km.
Based on the above analysis, the immediate-reward function
is defined as shown in Equation (5). When the distance error between the agent and the vessel is less than a certain threshold
, the agent receives a reward of +100; when the distance error is greater than
, the agent gets a penalty of −100.
The threshold δ is introduced according to the practical requirement of maritime target surveillance. In real monitoring systems, δ can be set as the effective detection radius of the sensor. If the actual vessel position falls within the detection range centered at the predicted position, the prediction is considered effective for surveillance and a positive reward is assigned. Otherwise, the prediction is regarded as ineffective and a penalty is applied. In the experimental setting, since the public AIS dataset is not associated with a specific sensor configuration, δ is empirically determined using the training set before PPO training. Specifically, the distance error of the CNN-based supervised learning model on the training set is used as the reference threshold, because the CNN baseline adopts the same trajectory feature representation as the proposed CNN-PPO model. This setting encourages the reinforcement learning agent to learn a navigation policy that improves upon the supervised CNN baseline.
Although the binary reward is consistent with the surveillance-oriented success criterion, it provides limited gradient-like feedback regarding the magnitude of the prediction error. Therefore, it may restrict learning efficiency and generalization under complex maneuvering conditions. Future work will consider continuous distance-aware reward shaping, such as exponential decay or piecewise reward functions, to provide more informative feedback while maintaining the application-driven interpretation of the detection threshold.
- (5)
Discount Factor
The agent’s goal is to find the policy that maximizes the cumulative reward, so the state-value function is introduced as a core evaluation metric, which is defined in Equation (6).
where
denotes the expected cumulative reward starting from state when following policy .
denotes the probability of taking action given the current state under policy .
denotes the immediate reward obtained after taking action in state .
denotes the expected cumulative reward of the next state under policy .
is the discount factor, which controls the weight of future rewards in the current state’s value estimation.
denotes the state transition probability, i.e., the probability of transitioning to the next state after executing action in state , which is determined by the vessel’s navigation kinematics.
If the target trajectory prediction problem can be described by an accurate mathematical model, methods from optimal control theory can be introduced to iteratively solve the state-value function. However, in practical maritime scenarios, the complexity of vessel motion and environmental interactions makes it difficult to explicitly model and obtain . Therefore, this paper adopts a model-free reinforcement learning approach to learn the optimal policy directly from historical trajectory data without requiring an explicit transition model.
In summary, the objective of this paper is to find an optimal navigation policy that maximizes the state-value function . That is, the agent outputs the optimal action based on the current state, which serves as a prediction of the target vessel’s trajectory, and updates its state for the next moment. Due to the difficulty of accurate dynamic modeling , an end-to-end reinforcement learning method is employed to learn the state-value function for the vessel trajectory prediction problem, thereby learning the navigation strategy of the target vessel and applying the learned optimal policy to trajectory prediction.
3. Trajectory Prediction Method Based on Deep Reinforcement Learning
This section presents a vessel trajectory prediction method that learns an optimal navigation strategy using a deep reinforcement learning model. First, the trajectory prediction process and the policy network’s structure are described. Subsequently, the method for training the policy network using deep reinforcement learning is introduced.
3.1. Construction of the Trajectory Prediction Policy Network
In this paper, the solution to the trajectory prediction problem is decomposed into two phases: vessel navigation strategy learning and vessel trajectory prediction, as illustrated in
Figure 3. In the vessel navigation strategy learning phase, the trajectory feature matrix is first fed into the agent, which then outputs the vessel’s position change for the next time step and updates the vessel’s trajectory. The agent then computes the error between the predicted and target positions, calculates a reward, and updates the policy based on that reward. Finally, the agent determines whether the terminal state has been reached, thereby terminating the current training task. In the vessel trajectory prediction phase, the agent receives the vessel’s current navigation state as input. The agent then computes the optimal action (i.e., the vessel’s position change) based on the learned vessel navigation strategy, which serves as the predicted vessel trajectory.
When the agent performs vessel trajectory prediction, it is required to make optimal decisions about the vessel’s position changes based on the trajectory feature matrix. In classical reinforcement learning problems, a value table (e.g., a Q-table)—can typically be used to represent a small, finite set of states. However, the vessel trajectory prediction problem involves an enormous state space with potentially continuous state values, and neural networks are superior at processing such continuous, high-dimensional data. Furthermore, when the target vessel executes complex maneuvers, the state-to-action mapping exhibits nonlinear characteristics, and neural networks can effectively fit these nonlinear relationships. For these reasons, employing a neural network as an approximator for the policy and value functions yields several distinct advantages.
This paper proposes a CNN-PPO deep reinforcement learning method that uses a network architecture comprising multi-layer convolutional neural networks and fully connected layers to approximate the policy and value functions. The specific network structure is shown in
Figure 4.
The proposed CNN-PPO method adopts an Actor–Critic architecture [
23], consisting of two main components: the Actor network and the Critic network.
The Actor network is a policy-generation network that parameterizes the decision-making policy . Its input is the target’s navigation-state feature maps, and its output is the decision policy—specifically, the position changes in the longitude (ΔLon) and latitude (ΔLat). The network structure includes convolutional layers (C1, C2) and a fully connected layer (F1), which update the navigation state.
The Critic network is a policy evaluation network that models the expected cumulative reward function. It outputs the state value to evaluate the Actor network’s policy output, and the Actor network then adjusts its decision policy based on this evaluation, enabling iterative optimization of the entire model.
3.2. Training of the Trajectory Prediction Policy Network
This paper adopts the Proximal Policy Optimization (PPO) algorithm [
24] to train the policy network. In the trajectory prediction problem, the agent makes sequential decisions, and each decision is recorded as a single step. In model-free reinforcement learning, policy update methods [
25] can be divided into Monte Carlo methods and temporal difference (TD) methods [
26]. This paper primarily uses the TD method to update and train the prediction policy network. The training process is illustrated in
Figure 5 and detailed as follows:
At each decision step, the network predicts a vessel trajectory and receives an immediate reward. During training, the TD method uses a one-step prediction to compute the state-value function, which is then used to update the network parameters. The Actor network is parameterized, and the parameters of the Actor network are updated and trained using the following formula:
where
represents the agent’s current policy and is used at each update iteration.
is a policy backed up at fixed time intervals. This off-policy update mode ensures more stable policy updates.
denotes the advantage function, which measures how much better the action is compared to the average expected return in state . It serves as a critical baseline to eliminate the variance in policy updates, ensuring that only actions that outperform the average performance are reinforced.
is a regularization term that ensures the sampling distribution does not deviate significantly from the original distribution.
To update the Critic network, a supervised learning approach is adopted to minimize the estimation error of the state-value function. The parameters of the Critic network are updated and trained using the following formula:
where
denotes the parameters of the Critic network.
is the length of the trajectory sequence.
is the subsequent time step of i.
is the immediate-reward function at time step .
is the output of the Critic network, which is the estimated expected cumulative reward at time step i.
The training process of the trajectory prediction model optimized for cumulative reward is as follows:
Based on temporal order, the navigation-state feature maps are extracted from time steps 1 to .
At time step i, the navigation-state feature maps are input into the Actor network, which outputs the decision policy —specifically, the predicted changes in longitude and latitude for the next time step.
The Critic network estimates the expected cumulative reward function of the state and outputs the state value .
The parameters of the Critic network are updated using Equation (8).
The Actor network adjusts its parameters based on the state value output from the Critic network, using Equation (7) for policy update.
4. Experiments
4.1. Experimental Description
4.1.1. Experimental Scenario
This paper proposes a deep reinforcement learning-based method for predicting a target vessel’s trajectory at multiple future time steps. In this section, experimental simulations are conducted using real trajectory data to verify the effectiveness of the proposed algorithm, and the prediction results are analyzed.
The scenario for position prediction is as follows: given the target vessel’s trajectory information at the first m hours, the vessel’s positions from m + 1 to m + 4 h are predicted. To verify the effectiveness of the proposed algorithm, preprocessing of the trajectory data is required, including trajectory data cleaning, segmentation to generate sequence samples, and data interpolation for uniform sampling.
All experiments are carried out on a workstation equipped with an AMD Ryzen 7 7800X3D CPU (8 cores, 16 threads), 32 GB memory, and NVIDIA RTX 4090 GPU. The software environment is configured with Python 3.11.10 and the deep learning framework TensorFlow 2.11.0. Data processing relies on NumPy 1.21.6 and pandas 1.3.5. The proposed CNN-PPO model is implemented based on a locally modified PPO implementation derived from Open AI Baselines (baselines v0.1.5).
4.1.2. Experimental Dataset and Preprocessing
In this study, the publicly available 2024 AIS dataset released by the National Oceanic and Atmospheric Administration (NOAA) is adopted to validate the performance of the proposed method. We select trajectory data covering the waters adjacent to China and the Western Pacific Ocean, with a geographical scope ranging from 0° (Equator) to 50° N in latitude and 100° E to 180° E in longitude. Only cargo ships with vessel type codes between 70 and 79 are retained for subsequent experiments.
The raw AIS data is processed in four sequential steps: data cleaning, voyage segmentation, time-series interpolation, and uniform resampling.
First, invalid and abnormal records are removed, including entries with missing longitude or latitude information, vessel speeds exceeding 30 knots, and instantaneous course changes greater than 90° within 10 s, and spatial outliers that violate physical navigation constraints.
Second, voyage segmentation is conducted using a time interval threshold of one hour. Consecutive trajectories with a time gap longer than this threshold are divided into independent voyages. Short and low-quality trajectory fragments are discarded, and each qualified voyage is assigned a unique voyage ID.
Third, given the non-uniform sampling intervals inherent in raw AIS data, dense interpolation is implemented at a 1 s interval. Linear interpolation is applied to numerical attributes including longitude, latitude, speed, and course, while forward filling is used for categorical fields such as MMSI and vessel type. The interpolated data is further resampled at a fixed 1 min interval and aligned to integer timestamps.
Finally, only voyages with a continuous duration of no less than 12 h are retained. The filtered dataset comprises 154 vessels, 260 valid voyages, and approximately 321,000 geographic position points. After uniform resampling at a 1 h interval, a sliding window scheme is established, taking the historical trajectory of the previous 10 h as model input and the subsequent 1 h trajectory as the prediction target.
The 260 valid voyages are partitioned at an 8:2 ratio at the voyage level, with 208 voyages assigned to the training set and 52 voyages to the test set. Sliding window sampling is then performed on the partitioned voyage data, yielding 6424 training samples and 1636 test samples. All baseline models and the proposed CNN-PPO method adopt the same test set to ensure fair and consistent comparative evaluation.
4.2. Experimental Results
4.2.1. Training Convergence Analysis
To evaluate the training stability of the proposed CNN-PPO framework, the average episodic reward obtained during reinforcement learning training is adopted as the convergence indicator.
Figure 6 illustrates the evolution of the average reward over the training episodes, where the horizontal axis denotes the training episode index and the vertical axis represents the average reward within each episode. As shown in
Figure 6, the average reward increases rapidly during the early training stage and gradually converges as the number of training episodes increases. This trend indicates that the proposed reinforcement learning framework can continuously optimize the vessel navigation policy through interaction with historical trajectory data. After sufficient iterations, the reward distribution becomes stable, suggesting that the CNN-PPO framework can learn an effective vessel motion prediction policy from historical navigation trajectories. The smooth convergence also indicates that PPO-based policy optimization improves training stability and alleviates the oscillation problem commonly observed in conventional policy-gradient methods.
4.2.2. Comparative Experiments
To evaluate the single-step trajectory prediction performance of the proposed method, comparative experiments are conducted against representative trajectory prediction approaches, including fitting extrapolation, grey prediction, BP neural network, LSTM, CNN, GCN, and Transformer-based models. To ensure fair comparison, all models are trained and evaluated using the same trajectory preprocessing procedure, identical training/test partitions, and consistent historical observation windows. The fitting extrapolation method adopts polynomial fitting, while the grey prediction model first performs spline interpolation to obtain uniformly sampled sequences before prediction. The BP and CNN models use the trajectory feature matrix proposed in this paper as input. The LSTM model directly models temporal dependencies in sequential position data. The GCN model employs two layers of symmetric-normalized graph convolution on a temporal chain graph with last-node readout, whereas the Transformer model uses a two-layer encoder with multi-head self-attention and last-token readout. The prediction error is characterized by the root mean square error (RMSE). The quantitative results are summarized in
Table 1.
As shown in
Table 1, the proposed CNN-PPO method achieves the best overall pre-diction performance among all compared methods. Specifically, the proposed model obtains the smallest latitude prediction error (0.0260°), longitude prediction error (0.0298°), and distance prediction error (4.29 km). Compared with the CNN model, the distance prediction error is reduced from 4.65 km to 4.29 km. Compared with the GCN model, the error is further reduced from 4.45 km to 4.29 km, indicating that the proposed reinforcement learning-based learning mechanism provides additional improvement beyond conventional feature extraction or sequence modeling.
Traditional statistical methods exhibit limited adaptability in this task. Although fit-ting extrapolation and grey prediction have relatively simple modeling processes, they rely heavily on stable motion assumptions and high-quality uniformly sampled data. In particular, the grey prediction method has strict requirements for sequence regularity, and many raw trajectory fragments cannot fully satisfy these requirements. Therefore, its prediction performance deteriorates under complex vessel maneuvering conditions.
Deep learning-based models generally outperform traditional statistical methods because they are better able to capture nonlinear relationships in vessel trajectory data. However, purely supervised models mainly minimize one-step prediction errors and do not explicitly optimize sequential decision-making consistency. The relatively weaker performance of the Transformer model in this experiment may be related to autoregressive error accumulation, limited training samples, and insufficient physical or navigational constraints. Similarly, the GCN model mainly captures local temporal dependencies along the constructed trajectory graph, but local aggregation alone may be insufficient to maintain global trajectory consistency over long prediction horizons. In contrast, the proposed CNN-PPO framework formulates vessel trajectory prediction as an MDP-based sequential decision-making problem. By optimizing cumulative rewards, the model learns a navigation policy rather than merely fitting trajectory points, which improves robustness and long-term consistency.
4.2.3. Multi-Step Prediction Performance
To verify the effectiveness of the proposed navigation strategy learning method for predicting vessel positions at multiple future time steps, the CNN-PPO algorithm is compared with other neural network algorithms under an autoregressive prediction setting. Let the current time step be denoted as
m. When predicting the vessel position at time step
m +
n, all models iteratively generate the sequence of intermediate predictions from
m + 1 to
m +
n. At each step, the predicted position is fed back into the input window for the next prediction.
Figure 7 presents the prediction error of latitude, longitude, and geodesic distance under different prediction horizons.
As shown in
Figure 7, the prediction errors of all compared models increase as the prediction horizon becomes longer. This is mainly caused by autoregressive error accumulation and the increasing uncertainty of vessel maneuvering behavior over time. Nevertheless, the proposed CNN-PPO model consistently maintains lower errors than the competing models across different horizons. In particular, when the prediction horizon reaches 4 h, the proposed method still achieves a distance prediction error of 11.29 km. This result confirms that the CNN-PPO framework has stronger long-horizon prediction capability. The advantage arises from its cumulative reward optimization mechanism: unlike conventional supervised learning models that focus only on local prediction accuracy, CNN-PPO explicitly optimizes the sequential decision policy, thereby improving global trajectory consistency during recursive prediction.
4.2.4. Computational Efficiency Analysis
To evaluate the practical applicability of the proposed method in real-time maritime surveillance scenarios, the training time and online inference latency of neural network-based trajectory prediction models are compared. The experimental results are shown in
Figure 8. The horizontal axis represents different prediction models, including BP neural network, LSTM, CNN, and CNN-PPO, whereas the vertical axis denotes the time required for model training and real-time inference.
The results show that the proposed CNN-PPO framework requires a longer offline training time than conventional supervised learning models because reinforcement learning involves iterative policy evaluation and policy improvement. However, after training is completed, the model only needs a forward inference process during online prediction, and its inference latency remains within an acceptable range for real-time maritime traffic monitoring. This characteristic is suitable for practical VTS applications, in which historical AIS data can be used for offline model training, while online deployment only requires rapid trajectory prediction based on the latest observations. Therefore, the additional offline training cost does not prevent the proposed method from being applied to real-time maritime surveillance.
4.2.5. Ablation Study on Convolution Kernel Size
To examine whether the selected convolution kernel configuration is appropriate, comparative experiments are carried out under different kernel sizes. The longitude, latitude, and distance RMSE values are reported in
Figure 9.
To investigate the influence of convolution kernel configurations on prediction performance, comparative experiments are conducted using different kernel sizes. The RMSE results of longitude, latitude, and distance prediction under different kernel configurations are shown in
Figure 9. The results indicate that the proposed CNN-PPO framework maintains relatively stable prediction performance under different convolution kernel set-tings, demonstrating the robustness of the network architecture. Among the evaluated configurations, using a 3 × 3 convolution kernel in the first convolutional layer and a 2 × 2 convolution kernel in the second convolutional layer achieves the best overall prediction accuracy. This configuration provides a reasonable balance between local spatial feature extraction and model complexity; therefore, it is adopted in the subsequent experiments.
4.2.6. Visualization of Trajectory Prediction Results
Figure 10a–d show representative trajectory prediction results of the CNN-PPO method for prediction horizons of 1 h, 2 h, 3 h, and 4 h. In these figures, the blue curves represent the actual trajectories, the red curves represent the predicted trajectories, the horizontal axis denotes longitude, and the vertical axis denotes latitude. The visualization results show that the proposed method can capture both approximately linear motion and nonlinear maneuvering patterns. Although prediction deviations gradually increase with longer horizons, the predicted trajectories remain consistent with the overall evolution trend of the actual vessel motion. This further confirms that the proposed CNN-PPO framework can learn vessel navigation strategies from historical trajectory data and maintain stable recursive prediction performance in medium- and long-term forecasting tasks.
5. Discussion
The experimental results demonstrate that the proposed CNN-PPO framework can effectively improve maritime target trajectory prediction, especially for medium- and long-term forecasting. The comparative results in
Table 1 show that CNN-PPO achieves the lowest one-hour prediction errors among all compared methods. This indicates that combining convolutional feature extraction with reinforcement learning-based policy optimization is effective for learning vessel motion patterns from historical AIS trajectories.
The training convergence shown in
Figure 6 further supports the feasibility of the proposed reinforcement learning formulation. The average reward increases rapidly in the early training stage and then converges to a stable level, suggesting that the agent gradually learns a stable navigation policy from historical trajectory samples. However, reward convergence alone should not be interpreted as complete proof of optimal prediction performance. Therefore, the comparative experiments and multi-step prediction results are necessary to jointly validate the effectiveness of the learned policy.
A key finding from the multi-step prediction experiment is that the proposed CNN-PPO model maintains better long-horizon stability than conventional supervised learning models. In autoregressive prediction, the output at one step is repeatedly used as part of the input for subsequent prediction. As a result, small errors may propagate and accumulate over time. This explains why the RMSE values of all models increase when the forecasting horizon extends from 1 h to 4 h. Nevertheless, CNN-PPO exhibits slower error growth and achieves an error of 11.29 km at the 4 h horizon, indicating its stronger ability to preserve global trajectory consistency.
The performance improvement of CNN-PPO can be explained from three perspectives. First, the CNN module transforms historical trajectories into multi-channel trajectory feature maps and extracts local spatial motion patterns from rasterized trajectory representations. This enables the model to capture short-term motion tendencies and local maneuvering features more effectively than purely vector-based time-series models. Second, the MDP formulation converts vessel trajectory prediction from a point-wise regression task into a sequential decision-making problem. Instead of directly fitting future positions, the model learns a navigation policy that maps the current trajectory state to the next position change. Third, PPO optimizes cumulative rewards over the sequential prediction process, which encourages the model to consider long-term trajectory consistency rather than only minimizing one-step prediction errors.
Compared with recent state-of-the-art deep learning models such as GCNs and Transformers, the proposed CNN-PPO framework shows a more stable performance in medium- and long-term trajectory prediction. GCN-based models are effective in capturing local graph-structured dependencies, while Transformer-based models are powerful for modeling temporal correlations through self-attention. However, both methods are essentially supervised learning frameworks and mainly optimize prediction errors at individual time steps. In multi-step autoregressive forecasting, such local optimization may lead to cumulative error propagation. In contrast, CNN-PPO formulates vessel trajectory prediction as a sequential decision-making problem and optimizes the cumulative reward over the entire prediction process. This enables the model to preserve global trajectory consistency more effectively, which explains its superior performance in the 1–4 h continuous prediction experiments.
The performance of CNN-PPO is also related to different trajectory conditions. For nearly straight or smoothly varying trajectories, most supervised learning models can achieve acceptable short-term prediction accuracy because the vessel motion pattern is relatively stable. However, when vessels exhibit nonlinear maneuvers, such as turning, speed variation, or course adjustment, purely supervised models tend to suffer from larger accumulated errors because they mainly optimize point-wise prediction accuracy. In contrast, CNN-PPO learns a sequential navigation policy by maximizing cumulative rewards, which helps preserve the global trajectory evolution trend. This makes the proposed method more suitable for medium- and long-term prediction under moderately nonlinear vessel motion conditions.
However, the proposed method still has several limitations. First, under highly abrupt maneuvering conditions, such as sudden course changes or emergency avoidance behaviors, the learned policy may not fully capture unexpected navigation decisions, leading to increased prediction uncertainty. Second, under sparse or irregular AIS observations, the constructed trajectory feature matrix may lose important motion information, which can weaken the model’s ability to infer accurate future positions. Third, the current framework mainly focuses on single-vessel trajectory prediction and does not explicitly model vessel–vessel interactions. In dense traffic scenarios, collision avoidance, encounter situations, and navigation-rule constraints may significantly influence vessel motion. Fourth, the current model does not incorporate environmental factors such as wind, waves, ocean currents, visibility, or traffic separation schemes, which may affect prediction robustness in complex maritime environments.
From the perspective of computational efficiency, the proposed CNN-PPO method introduces higher offline training cost than conventional statistical methods because reinforcement learning requires iterative policy evaluation and policy improvement. Nevertheless, once the model is trained, online prediction only requires a forward inference process. Therefore, the model can satisfy real-time inference requirements in maritime traffic monitoring applications. This characteristic is suitable for practical VTS systems, where historical AIS data can be used for offline training and real-time trajectory prediction can be performed rapidly during online deployment.
Overall, the proposed CNN-PPO framework provides a new perspective for maritime trajectory prediction by integrating convolutional spatiotemporal feature extraction with reinforcement learning-based navigation policy optimization. The experimental results confirm that the model can improve both one-step prediction accuracy and multi-step prediction stability. Future research will further incorporate environmental information, vessel–vessel interaction modeling, and navigation-rule constraints to enhance the robustness and practical applicability of the proposed method under more complex maritime traffic conditions.
6. Conclusions
This paper focuses on the medium- and long-term prediction of vessel trajectories. From the perspective of optimal navigation strategy learning, a trajectory prediction model based on the MDP is established, transforming the trajectory prediction problem into an optimal policy-solving problem. Specifically, a multi-channel trajectory feature matrix is constructed to extract the spatial correlation rules of vessel trajectories, while the MDP framework is used to model the temporal evolution of vessel motion as a sequential decision-making process. On this basis, a CNN-PPO deep reinforcement learning method is introduced to solve the optimal policy that maximizes the long-term cumulative reward, enabling continuous time-series prediction for medium- and long-term maritime target motion prediction. Experimental results on the public AIS dataset show that the proposed method not only achieves accurate prediction of vessel positions at a specific future moment, but also maintains better trajectory consistency in multi-step prediction. Compared with traditional statistical methods and representative neural network models, CNN-PPO achieves lower prediction errors and shows more prominent advantages as the prediction horizon increases. In terms of computational efficiency, CNN-PPO requires a longer offline training time than standard statistical methods due to the iterative policy optimization process. However, once trained, the model only requires forward inference for online prediction, and its real-time inference latency remains acceptable for maritime traffic monitoring applications.
7. Limitation of the Study
In this paper, the proposed single-agent CNN-PPO prediction model focuses on the independent navigation strategy learning of the target vessel and still has certain limitations in practical applications. First, the model does not consider real-world environmental factors such as weather conditions (wind, waves, currents), maritime traffic rules (COLREGs), and multi-vessel interactions. Second, as a single-agent framework, it faces challenges in scaling to dense maritime traffic environments and collision-avoidance maneuvers. These shortcomings restrict the practical applicability of the method in real operational scenarios.
8. Future Recommendations
To improve the practicality and scalability of the model, future research will be carried out in the following aspects. First, incorporate meteorological data (wind speed, wave height, current velocity) into the multi-channel trajectory feature matrix to enhance the environmental perception ability of the model and improve prediction robustness under complex sea conditions. Second, introduce standard navigation rules and encounter constraints into the reward function design of the reinforcement learning model, enabling the agent to learn decision-making behaviors that conform to real-world navigation specifications. Third, add surrounding vessel information and collision-avoidance constraints to adapt the model to dense traffic scenarios and realistic maneuvering decisions. By introducing the above real-world factors and multi-vessel interaction information, the prediction accuracy, robustness, and practical applicability of the model will be significantly improved.