The clipped loss function can be defined as follows. To prevent drastic changes in the policy, the probability ratio
a is constrained not to deviate beyond the range [1 −
ε,1 +
ε].
where
is clipped loss function,
is the advantage function evaluated at time
t,
is the clipping threshold, and
is restricting
to lie within the interval
.
The overall loss function is composed of three components. First, the policy loss serves as the core objective in PPO, incorporating a clipping mechanism to ensure stability during policy updates. Second, the value loss enhances the accuracy of the state-value function by penalizing deviations from the predicted returns. Lastly, the entropy bonus promotes sufficient exploration by encouraging policy stochasticity and preserving action diversity. The total loss function employed in PPO is defined as follows:
where
is the clipped policy objective function,
is the value estimation loss,
is the entropy-based exploration bonus, and
c1,
c2 is hyperparameters that weight the contributions of the value function loss and the entropy regularization term in the overall objective.
Fuel Consumption Minimization Model—PPO
A reinforcement learning framework using the PPO algorithm is applied to derive fuel-efficient ship routing strategies. The state space is a six-dimensional vector including cumulative distance, vessel speed, wind speed, current speed, wave height, and wave period. It has a discrete action space, where actions consist of position and speed; the step size is chosen to be between 10 and 19 knots on a uniform grid. In each decision step, the agent sees the current state and selects an optimal position or speed combination given its state.
In position selection, valid positions are only ones that differ by at most two grid points to the left or right (lateral window) from the last selected point. Specifically, we formulate the reward function as a negative reward of the fuel consumption predicted by a transformer-based fuel consumption prediction model, intending to let the agent learn routing policies that lead to lower fuel consumption.
The ship fuel consumption minimization problem via waypoint optimization can be mathematically formulated as a Markov Decision Process (MDP), enabling the application of reinforcement learning algorithms to derive optimal routing policies. MDP is defined by a set of states S
t, actions a
t, rewards r
t, a state transition function P(s
t+1∣s
t, a
t), and a termination condition T. Based on this formulation, the agent learns optimal routing and speed control policies. In the reinforcement learning environment, each state s
t is represented by six features: the distance between the current and previous waypoints, speed over ground (SOG), wind speed, ocean current speed, wave height, and wave period. The state S
t of the reinforcement learning environment at time step t is composed of six elements, capturing key dynamic and environmental factors affecting ship routing. These include the distance between the current and previous waypoints (
dt), the currently selected vessel speed (
vt), wind speed (
wt), ocean current speed (
ct), wave height (
ht), and wave period (
pt). The state vector is formally expressed as follows:
In reinforcement learning, the state space
S varies dynamically depending on the batch size
B, and the space of observation vectors can be defined as follows:
where
B is the batch size.
The agent’s action
at at time step
t is represented as a combination of two components: the index of the selected waypoint
lt and the vessel speed
vt. The action is formally expressed as follows:
The action space
A is defined as the Cartesian product of the discrete set of waypoint indices and the discretized set of vessel speeds. Formally, it can be written as follows:
At each time step
t, the agent selects a combination of waypoint index
lt and vessel speed
vt, represented as the action a
t. The reward function rt is designed to encourage the agent to minimize fuel consumption. Specifically, the fuel consumption F
t at time t is predicted by a fuel consumption prediction model
ffuel(
xt), where
xt denotes the feature vector at the current time step. The fuel consumption is computed as follows:
The reward
rt is then defined as the negative value of the predicted fuel consumption, such that lower fuel consumption results in higher rewards. Formally, the reward is expressed as follows:
This reward formulation drives the agent to learn a policy that reduces the overall fuel consumption.
The state transition function
P(
st+1/
st,
at) defines the transition to a new state
st+1 resulting from the agent’s action a
t taken at time step t, given the current state
st. The state transition is represented as follows:
where
fstate denotes the state transition function,
st is the current state, and a
t is the agent’s action.
The state transition process involves computing the distance between the previous and current waypoints, as well as the corresponding fuel consumption, to generate the next observation state st+1.
The termination condition
Et defines the criteria for ending an episode. The episode terminates when the reinforcement learning agent reaches a predefined maximum number of time steps. The termination condition is formally expressed as follows:
where
Tmax is the total number of time steps,
Wt is the number of waypoints evaluated up to time
t, and
Wfinal is the target point or the final waypoint.
By leveraging the defined state space, action space, reward function, state transition dynamics, and termination condition, the waypoint optimization problem is formally modeled as a Markov Decision Process (MDP). This formulation enables the agent to learn a policy that selects optimal waypoints lt and vessel speeds vt at each decision step, with the objective of minimizing overall fuel consumption.
The reward function graph for the PPO model designed to minimize fuel consumption is shown to converge as illustrated in
Figure 4.
Waypoint Number Optimization Method—PPO
The PPO algorithm optimizes waypoint location and count by learning terrain-aware placement through a depth-based reward system. The state space includes latitude, longitude, and water depth at the current location. The agent performs three actions: selecting a location, adjusting speed, and adding or deleting waypoints. The reward function considers both depth and waypoint characteristics.
Based on the vessel’s maximum draft, shallow water is defined as under 50 m and deep water as over 200 m. Deleting waypoints in deep water yields a positive reward, while doing so in shallow water results in a penalty. To maintain a reasonable waypoint count, strong penalties are applied if the number falls below 3 or exceeds 100. The allowed distance between waypoints ranges from 10 km to 200 km, with penalties beyond this range. This structure leads the agent to prefer sparse waypoint placement in deep waters and denser placement in coastal or shallow areas, mimicking experienced mariners’ navigation strategies. This reduces unnecessary course changes, enhances efficiency, and simplifies routing. The state space is defined as follows:
where
latt,
lont is the latitude and longitude coordinate observed at time
t and
dt is the bathymetric depth at time
t.
In reinforcement learning, the state space S varies dynamically depending on the batch size B, and the space of observation vectors can be defined as Equation (15).
The agent’s action
At at time step t is composed of two components, as follows:
The first component, lt, represents the index of the waypoint selected by the agent at time step t. It is chosen from a discrete set of available waypoints, where lt ∈ {1, 2, …, L}. The second component, at, represents the type of modification applied to the waypoint set. This is a discrete action where at ∈ {0, 1, 2}, with the following meanings:
at = 0: maintain the current waypoint set;
at = 1: add a new waypoint;
at = 2: delete an existing waypoint.
This action structure allows the agent not only to select navigation waypoints but also to dynamically adjust the waypoint set during training, thereby enabling more flexible route optimization.
The action space
A consists of combinations of possible waypoint indices and discrete operations—maintaining, adding, or deleting a waypoint. It can be defined as follows:
The reward function
Rt, which evaluates the agent’s action
At, is constructed based on three key criteria. The first criterion is a depth-based reward: adding waypoints in deeper waters (i.e., open ocean) is rewarded, whereas adding waypoints in shallow waters (i.e., coastal regions) incurs a penalty. The second criterion is a waypoint-count-based reward: a penalty is applied when the number of waypoints exceeds a certain threshold, discouraging excessively dense routing. The third criterion is a distance-based reward: a penalty is imposed if the distance between consecutive waypoints becomes too small, encouraging spatial efficiency in the planned route. The reward function is formally defined as follows:
where
rdepth is a depth-based reward,
rdistance is a distance-based reward between waypoints, and
rwaypoint is a reward based on the number of waypoints.
The state transition function
fstate maps the current state and action to a new state
St+1, representing the environment’s evolution when the agent executes action
At at time t. The state transition function T is defined as follows:
In the state transition process, the new observation state is generated based on the current waypoint positions, the addition or deletion of waypoints, and the distances between consecutive waypoints. The termination condition E defines the criteria for ending an episode. An episode terminates either when the reinforcement learning agent reaches the predefined maximum time step or when all waypoints have been visited. The termination condition E is defined as Equation (21).
The episode terminates when the agent reaches the destination point of the route, at which point learning is halted and a new route is initiated for training.
The reward function graph for the PPO-based model designed to optimize the number of waypoints is shown to converge, as illustrated in
Figure 5.