An Improved Distributed Sampling PPO Algorithm Based on Beta Policy for Continuous Global Path Planning Scheme

Traditional path planning is mainly utilized for path planning in discrete action space, which results in incomplete ship navigation power propulsion strategies during the path search process. Moreover, reinforcement learning experiences low success rates due to its unbalanced sample collection and unreasonable design of reward function. In this paper, an environment framework is designed, which is constructed using the Box2D physics engine and employs a reward function, with the distance between the agent and arrival point as the main, and the potential field superimposed by boundary control, obstacles, and arrival point as the supplement. We also employ the state-of-the-art PPO (Proximal Policy Optimization) algorithm as a baseline for global path planning to address the issue of incomplete ship navigation power propulsion strategy. Additionally, a Beta policy-based distributed sample collection PPO algorithm is proposed to overcome the problem of unbalanced sample collection in path planning by dividing sub-regions to achieve distributed sample collection. The experimental results show the following: (1) The distributed sample collection training policy exhibits stronger robustness in the PPO algorithm; (2) The introduced Beta policy for action sampling results in a higher path planning success rate and reward accumulation than the Gaussian policy at the same training time; (3) When planning a path of the same length, the proposed Beta policy-based distributed sample collection PPO algorithm generates a smoother path than traditional path planning algorithms, such as A*, IDA*, and Dijkstra.


Introduction
Trade transportation is the basis of economic development, and, with the growth in trade volume in various countries, shipping accounts for more than 2/3 of international cargo transportation. Accidents, however, have been continuously posing risks to maritime transportation; according to one survey, more than 80% of accidents are caused by human factors in the process of ship navigation [1]. Path planning, as one of the key technologies for traditional autonomic navigation scheme, aims to find a safe and economic path for ships and has received increasing attention from both industry and academia in recent years [2].
The plan of the path provides the optimal route based on various factors such as distance, time, etc. Considering divergent factors, path planning can be divided into local path planning and global path planning [3]. Local path planning indicates that the changes in the surrounding environment are unknown (mainly reefs and dynamic obstacles), and the path needs to be corrected based on sensor data in real time. Global path planning involves planning a route from the starting point to the endpoint, founded on known environmental information at a large geographic scale (1 km to 100 km level, or more).
This paper presents an in-depth study of global path planning. Traditional path planning algorithms currently used to solve global path planning include the A*, Dijkstra, and artificial potential energy field methods. Most of these algorithms use spatial discretization to represent the environment as equal-sized raster maps and discretize the action space into four-connected or eight-connected [4]. The limitations of these path planning algorithms include the following. First, these algorithms rely on prior knowledge of the system when used for global path planning. Consequently, if there are changes in the environment, such as the emergence of new obstacles, it is necessary to obtain updated and accurate information about the environment in order to re-plan the route. In addition, these algorithms do not take into account the ship's navigational capability, only searching the driving path without giving the bow heading and velocity vector of the nodes in the path. Moreover, these algorithms search the way topologically, leading to much higher computation and lower efficiency required for planning with the increasing complexity in the planning scene.
To address the challenges, as mentioned before, of traditional path planning, Sun proposed a dynamic trajectory generation model for ship rust removal robots based on a bionic neural network and utilized a kinematic state matching search method to achieve dynamic path planning while considering the robot's kinematic model [5]. However, during actual testing, it was observed that the proposed method lacked robustness in the planning process, making it unable to guarantee successful planning. To improve the robustness and stability of the algorithm in the path planning task, Shen et al. from Dalian Maritime University proposed an intelligent collision avoidance navigation method for unmanned ships, which considered navigation experience and collision avoidance rules using Deep Competitive Q-learning (DCQN) with the A* algorithm [6]. Similarly, Wang et al., also from Dalian Maritime University, proposed the DRLOAD algorithm based on the Deep Q-Network (DQN), which achieved a multi-dimensional navigation by designing a reward function specifically for safe obstacle avoidance [7]. Both methods employed value learning methods in reinforcement learning to solve path planning problems in a discrete action space. Nevertheless, these approaches faced issues with the reward function's design and the ship propulsion strategy's incomplete structure during the planning process. Another relevant contribution is the path planning algorithm proposed by Kuczkowski, which is based on the Deep Deterministic Policy Gradient (DDPG) algorithm. This approach integrates several subgoals of path planning, collision avoidance, compliance with the Convention on the International Regulations for Preventing Collisions at Sea (COLREGs), and task completion by designing four reward functions [8]. Kuczkowski successfully achieved path planning and dynamic obstacle avoidance that satisfied COLREGs in complex scenarios. While adequately designing the reward functions and providing each path node's coordinates and velocity vectors during planning, the environment modeling did not consider the ship's kinematic model.
A summary of the above related work is presented in Table 1. Based on the above research work, this paper proposes a distributed sampler PPO algorithm based on Beta policy to address the pain points, i.e., that traditional path planning increases computationally in the case of rising scenario complexity and cannot directly deal with high-dimensional data. Furthermore, it addresses the problems that reinforcement learning algorithms face in solving path planning problems, such as the unreasonable design of the reward function-most algorithms only perform path planning in the discrete action space and primarily use deterministic policy in planning. The main technical contributions of this paper are as follows: 1.
Designing and constructing a simulation environment that conforms to the ship propulsion characteristics, using the Box2D physics engine as the framework; 2.
Constructing a reward function that integrates the mainline reward and auxiliary reward, where the mainline reward is determined based on the distance between the agent and the arrival point, while the auxiliary reward is potential-based, superimposed from boundary control, obstacles, and arrival point; 3.
Proposing a distributed sampling policy on the basis of the stochastic policy algorithm PPO, which significantly enhances sampling performance and enables robust ship path planning in continuous domain spaces;

4.
Introducing the Beta policy as a replacement for the Gaussian policy in the action selection policy. This substitution effectively resolves the issue of action boundary constraints. The advantages of the algorithm proposed in this paper are reflected as follows. By employing the stochastic policy algorithm, this paper achieves better planning results by simulating a ship's propulsion characteristics in a continuous state and action space. The integration of the mainline and auxiliary rewards results in a smooth trajectory, with each planned path node providing position and velocity vectors. Our proposed algorithm attains shorter planning paths than other reinforcement learning algorithms. Moreover, when compared to traditional path planning methods with similar path lengths, our algorithm generates smoother trajectories.

Background
In solving global path planning problems by reinforcement learning, the learning tasks chosen vary for the different discrete and continuous action spaces in global path planning. Reinforcement learning is usually described by a Markov decision process (MDP), which consists of a state-space S and an action space A. In MDP, the individual who performs the learning and decision-making is called the agent, and the range with which the agent is interacting is called the environment. At the discrete moment t, the agent observes the states S t (S t ∈ S) fed by the environment and gives feedback actions A t (A t ∈ A) to all these states, according to the policy π, where the policy can be deterministic or stochastic. The environment provides a reward R t in response to the action and moves to the next state S t+1 ∈ S; the state-action pairs generated by this agent-environment interaction constitute a trajectory: The learning tasks of reinforcement learning (RL) can be divided into value-based and policy-based categories.

Value-Based Methods for Discrete Domain
Traditional RL uses the Q-learning method and stores the transfer state by Q-table when applied to path-planning tasks. To solve the problem that, when the state complex increases, the efficiency of the Q-learning update drops, DQN, as proposed in 2013, adds the deep neural network and experience replay based on Q-learning [9]. Although it improves the policy performance dramatically and can be used to handle high-dimensional state-space tasks, DQN generates an unavoidable overestimation problem in the process of action-value Q estimation, which occurs because DQN performs action-value estimation based on the Bellman Equation (1): In this context, the action-value function Q * (s t , a t ) is defined as the expected value of the immediate reward r t+1 obtained by taking action a t in the current state s t , plus the expected value with discounted maximum action-value Q * (s t+1 , a ; θ t ) of the next state s t+1 , corresponding to the action a chosen by the target policy π with parameters θ t .
Here, θ t represents the target network parameters for the policy π, and γ is the discount factor 0 ≤ γ < 1. The target network always selects the action that maximizes the actionvalue Q * (s t , a t ) according to its own policy θ t , which leads to bootstrap. Silver et al. [10] replaced the bootstrap action generated by the target network with online network policy to update the action value estimation: The online network Q(s t+1 , a; θ t ) is used for action selection, and the corresponding action is used for target calculation; due to a delay in the update of the network target, the actions that correspond to the value q-max of the online network are not necessarily identical to the actions that correspond to the value q-max in the network target. Thus, the problem of overestimating is reduced.

Policy-Based Method for Continuous Domain
This value-based learning method cannot solve the stochastic policy problem well because it is a deterministic policy in essence. The policy gradient-based [11] reinforcement learning algorithm optimizes a policy function by fitting a continuous function representing the current task performance metric and using gradient ascent. In this state, the value function can be used to assist the policy training, without need for action choice. The policy gradient-based approach has more robust convergence than the action-value-based approach; the policy gradient ∇ θ J(π θ ) is expressed as follows: where ρ π (s) is the state distribution, π θ (a | s) denotes current state s according to the policy sampled by the parameter θ, and g q represents the policy gradient estimator with Q π (s, a) as the target: In the actual calculation operation, the policy gradient is estimated by sampling g q in large numbers, so the sample mean of g q converges to its expected value E g q = ∇ θ J(π θ ) when the number of samples is sufficient.
However, the policy gradient based method has the problems of falling into local optima and inefficient policy evaluation. To solve these problems, Sutton et al. proposed Actor-Critic. The Actor refers to the policy function, i.e., the agent action decision part. Critic refers to the value function that can be regarded as an evaluator of the current state. The Critic and Actor parameters are updated by the data obtained from the real-time interaction between the agent and the environment, which makes the algorithm converge.
Schulman et al. proposed the Proximal Policy Optimization (PPO) algorithm [12] by using the Trust Region Policy Optimization (TRPO) [13] surrogate objective; the PPO clips the surrogate objective to a smaller range and replaces the advantage function A π (s, a) in the agent with a Generalized Advantage Estimator (GAE) [14], which is used for highdimensional continuous action training by GAE to obtain stable policy. To ensure continuous exploration, even in states where the policy is relatively mature, PPO incorporates entropy rewards when calculating the surrogate objective. Additionally, the inherent advantage of its stochastic policy in path exploration prevents the algorithm from ending up trapped in local optima during the search process [15].
This paper uses PPO as the baseline algorithm for global path planning implementation and optimization. Firstly, we conducted path planning environment design, and in the environment design process, we needed to consider the state-space and action space structure for deep reinforcement learning. For the algorithm to converge properly, we needed to design the reward function to be highly matched for the state-space of the global path planning task. The rationality of the reward function design directly guides the algorithm's convergence speed and policy performance. The PPO limits the policy update magnitude by clipping the surrogate objective, so that the old and new policies do not produce significant changes in the update and, thus, obtain more stable policy updates.

Interaction Environment Design
Deep reinforcement learning updates the state through constant interaction between the agent and the environment, and optimizes the policy to maximize the reward. This method requires environment design, including state-space, action space, and reward function. The environment design determines the scene structure and interaction logic and adds constraints to each scene element.
The simulation environment is constructed using the Box2D physics engine for structural development, and the Pyglet module is utilized for rendering the plane frame. The size of the plane frame structure is set as n * n, with a render scale of s. Obstacles are generated at the center of the frame, using a generation scale of τ. The obstacle area is divided into a grid space of k * k, with a grid radius of a. The obstacle radius ranges from 0.2a to 1.5a, and the number of obstacles generated falls within the range k, k 2 − k . To prevent overlapping, an offset is applied to the obstacle generation coordinates. This offset follows a uniform distribution U(−n * τ * 0.05, n * τ * 0.05), ensuring that no obstacles overlap with each other. Unlike other path planning simulation scenarios for RL, this simulation environment increases the ship and the endpoint generation range to improve the algorithm's robustness. As shown in Figure 1, the orange-covered area is the hull generation area, and the blue area is the arrival point generation area.

State-Space Design
The state information used for the global path planning task of deep reinforcement learning can be divided into the sensor data sequence and the image from the environment. DRL extracts the relationship between the features and the policy from training. Due to the sensor data sequence's low dimension, it is easy for the agent to acquire the policy according to the shift features; the image contains a higher dimensionality message, so the agent needs to learn the policy from the image extracted feature. As the network processes the state information differently, we designed two state-spaces, described below.

Environmental Feedback Sensor Data Sequence-Based State-Space
To make the agent perceive the feature shifts in the motion, we designed an approach using LIDAR to measure the distance to avoid obstacles, as shown in Figure 2. We set the ship's radius as b; the effective reflection range of LIDAR is within h times the vessel's radius, expressed as [b, h * b]. We set the horizontal field angle of the detection rays to ρ and uniformly arranged the l ray beams within the field. During the ship's motion, if another object truncates the ray, it is considered that the object is closing. In this case, we gave the ray identifier f lag i associated with the intercepted object, as shown in the following equation: Calculate the distance to the truncation point, denoted as ray i ; the vector is shown in Equation: G = {( f lag 1 , ray 1 ), ( f lag 2 , ray 2 ), . . . , ( f lag l , ray l )} The l-beam returns detection feedback based on the environmental information sensed by the vessel's current position. To reduce the computational effort of the algorithm, we find when l = 24 can obtain the best efficiency.
The data sequence state-space consists of five different sensor states: • Position sensor: obtain the current vessel's coordinates (x i , y i ); • Speed sensor: obtain the current vessel speed vector v x , v y ; • Heading sensor: obtain the current ship's bow direction (true heading TC) θ N ; • Radar: obtain the perception information about the vessel G = [g 1 , g 2 , . . . g l ]; • Distance sensor: obtain the straight-line distance D to the endpoint. Having completed the state-space design, we process the data and send it to the network for inference. During processing, the data are normalized, stacked, and formed into consecutive frames by superposition. The normalization, which, according to the maximal value of different sensors, is present to reduce the scale difference between samples and make the gradient update more stable, thus helping faster convergence to the optimal policy.
Before executing the continuous frame data overlay, we convert the data dimension to 1, f rame overlay , state len , concatenate along the direction of dim = 1, and flatten the concatenated data at the last dimension, i.e., 1, f rame overlay * state len . By overlaying multiple data frames, the agent can understand the current motion direction and speed of the ship and, thus, make reasonable decisions. In contrast, it is not available in single-frame data.

Image-Based State-Space
The image-based state-space contains the current agent scenario image and navigation data. The agent uses the image to evaluate the relationship between the ship and the arrival point, boundary, and obstacles to realize the perception of the surrounding area. During the data processing, we divide the image and navigation information into two parts, the image part (which handles grayscale and normalization), and the navigation part (which normalizes). After that, we stack features to the same dimension at each part, and the network fuses the features of these two parts.
In image processing, the image shape which returned from Pyglet rendering RGB data is [3, sn, sn]. We convert the three-channel RGB image into a single-channel grayscale map, and the image size is appropriately reduced to [1, 1, x, x] (x < sn) for policymaking, while the grayscale image is normalized in the image processing.
The image-based state-space consists of the following vectors: • Current scenario seascape: image data in the shape of [1, 1, x, x] with normalization operation; • Position vector: spatial coordinate information (x i , y i ) where the current ship is located; • Velocity vector: obtain the current ship's velocity vector information v x , v y .
We keep the same data overlay operation with the data sequence state-space and concatenate the two groups of processed image and navigation feature data along the direction of dim = 1 (the image shape is [1, 1, x, x] and the navigation data shape is [1, 1, 4]). Finally, we obtain the overlayed feature data with the image shape 1, f rame overlay , x, x and the navigation data shape 1, f rame overlay * 4 .

Action Space Design
The Box2D physics engine is used in the environmental design for ship action control. The simulated trajectory of the ship's rigid body is calculated by the semi-implicit Euler method to generate a planning curve in path planning that matches its physical characteristics for simulated propulsion control. To ensure accurate planning paths, the action space is designed in two dimensions, i.e., angle and propulsion power: 1.
Angle dimension: In this path planning model, the angle design uses continuous space, ranging from 0 to 360 degrees, and is mapped normalized to the interval [0, 1]; differently from traditional discrete angle design (such as four-connected or eight-connected actions), the continuous angle control allows the algorithm policy to perform more accurate navigation control to achieve more precise planning; 2.
Propulsion power dimension: Propulsion power is the power generated by the ship's thrusters. In this model, we use the Box2D to complete the propulsion power design, which considers the effects of thrust derating and fluid resistance. The maximum value of the main propulsion power is set to 70. Consistent with the angle treatment, the propulsion power is also normalized and mapped to the interval [0, 1]. In summary, the action space is designed with continuous control in angle and propulsion power dimensions. The agent can flexibly adjust the thrust angle and power to competently adapt to the physical characteristics of the ship control in the path planning task and, thus, perform accurate planning in different scenarios.

Reward Function Design
The reward function should correlate highly with the state-space to make the agent learn the expected policy [16]. We divided the reward function into mainline and auxiliary reward, where the design of mainline reward R dist mainly considers the distance between the ship and the destination, and the creation of auxiliary focuses on the effect of arrival point, boundary control, and obstacles on the traveling state. We set the initial distance between the vessel and the target generation position at initializing stage as dist init . While the agent interacts with the environment, we record the distance between the ship and the endpoint as dist rec . With the hull position changing, if the current distance dist cur is less than dist rec , a positive reward is given and rewritten dist rec by dist cur ; otherwise, it is returned as negative. The main line reward is as follows: To prevent the agent from abnormal state-shifting in policy learning, which leads to failure to learn the excepted policy, we solved this problem by creating a reward function with a potential-based reward shaping method [17], i.e., the repulsive field supported by obstacles and boundary control, and the gravitational field supported by arrival points.
The agent is given a decreasing penalty when it is close to the endpoint and an increasing penalty when it is close to the obstacles or boundary. This reward format can help the agent learn the robust correlation between state and reward, thus enabling the agent to accelerate the agent policy convergence while maintaining the invariance of the optimal policy.
The reward function is designed to remap the rendered image of n * n onto a plane of size m * m(n > m) and calculate the potential field value at each point on that plane. The specific details are as follows: 1.
The repulsive field generation is calculated by the potential decline of the 2.5 times obsc R obstacles range. The distance from the ship to the obstacle's center is dist C . We design the reward function satisfy the factors of the potential; note that the bias calculated by obsc R is α. After the matrix is generated, the potential matrix generated for each obstacle is normalized and summed, as shown in Figure 3a and explained below:

2.
The arrival point-based gravitational field is generated by calculating the distance between different positions of the potential matrix and the arrival point, and calculating the gravitational potential decay based on the span. Setting the radius of the arrival point as reach R , the distance between the current position and the arrival point is dist C . The bias calculated by reach R is β. Figure 3b shows the arrival point-based potential plane, as calculated below: 3.
The Repulsive field generation, based on boundary control, makes the agent consider the unknown of the boundary situation in path planning when the agent output action makes the ship move towards the boundary, gradually increasing the penalty. Thus, this ensures the route selected by the agent favors the known and clear center region and improves the planned route's security. If we set that the penalty boundary is bound limit , the distance between the current position and the nearest edge is dist limit , and the bias calculated by dist limit is γ. Figure 3c shows the potential plane based on boundary control, calculated as follows: From the above analysis, after obtaining the potential matrices with obstacles, arrival points, and boundary control, we can superimpose them and normalize them to obtain the complete potential-based reward function, as shown in Figure 4.

Beta Policy-Based Distributed Sampling PPO
When applying the policy gradient method for continuous domain control path planning, the DDPG [18] algorithm is usually used to solve complex tasks with highdimensional data in the continuous domain. However, since DDPG is a deterministic policy, exploration noise needs to be added at the early stage of training to help the agent explore the environment. Suppose the agent cannot obtain enough mainline rewards during exploration. Under the influence of sparse rewards and deterministic policy, DDPG quickly converges to a lousy policy [19], which makes the actor enter a deadlock state, which cannot be jumped out of by increasing mainline rewards later.
DDPG suffers from long training time, low exploration efficiency, and unstable learning strategies in path planning tasks, leading to poor reproducibility and unsuitable path planning in complex environments.

PPO Algorithm Applied to Global Path Planning Task and Optimization
As an on-policy algorithm, PPO uses the surrogate objective to optimize the policy, making it available to figure out discrete and continuous action space. PPO performs action sampling through the latest version of the stochastic policy, which helps the algorithm complete the environment exploration at the early training stage. With training progress, the policy gradually changes from high entropy to low entropy. Hence, the stochastic policy also benefits when facing complex stochastic environments.

PPO Algorithm Operation
The PPO algorithm is a variant of the policy gradient method, which typically calculates a policy gradient estimator and uses the stochastic gradient ascent to optimize. PPO uses the clipping of the surrogate function to reduce the problem of excessive deviation of the new policy from the old one: Above, π θ and π θ old denote the new and old policies, and the hyperparameter is used to limit the difference between the two policies. Suppose, in state s, A π (s, a) > 0; this indicates that the action-value brought by the selected act is greater than the average in the current state. In that case, the update towards the gradient ascent direction increases the selection of actions in such states. A π (s, a) < 0 indicates the action-value brought by the selected act is lower than the average in the current state, and the update occurs in the gradient descent direction to decrease the selection of actions in such states. With the Actor and Critic training, the Critic evaluates more accurately, and the policy network's distribution entropy decreases with the training to the convergence state.

Actor-Critic Network Structure
The Actor network uses the same feature extraction structure as the Critic network. However, the feature fusion differs due to the different output dimensions; the Actor and Critic networks vary slightly depending on the state-space.
The network for processing the data sequence state-space is a four-layer fully connected structure, and the hidden layer is [400, 300, 300]; the network for processing the image state-space adds feature extraction layers to extract information from image state, and the structure of the Actor in the image state-space is shown in Figure 5. The image feature extraction structure consists of four convolutional layers. The first three layers use the same convolutional architecture of DQN when processing Atari images [9]. The fourth layer adds a convolutional layer with an output channel of 256, a kernel of 3 * 3, strides of 1, and a padding of 1. The extracted feature is concatenated with the current ship's navigation information and then fed into the fully connected network with the hidden layer of [1124, 500, 300]. Finally, the mean and variance of Gaussian distribution are output.
Since the current model is used to solve the global path planning task, the model can be made to learn the spatio-temporal relationship of object motion on consecutive time frames through data frame superposition. To match the continuous data frame superposition structure used in the two state-space designs, we consider multi-frame superposition in the design of both model network structures.

Problems of PPO Algorithm in Global Path Planning
When training with PPO for images, the different areas may lead to the environment not being fully explored, i.e., the generation of random positions can cause the agent to enter the termination state early due to collision with the object prematurely. This results in an extreme lack of samples, making it too difficult to train at some starting positions owing to the insufficient positive feedback, resulting in the agent learning only those policies for which data are easily obtained in the training process. Figure 6a shows the test results of six positions selected by equal spacing. From Figure 6a (subgraphs c and d), the agent can learn a better policy at the corresponding starting position in graph a due to the uneven sample acquisition. In contrast, the policies adopted by the agent at the related starting points in the b, c, e, d, and f graphs are similar to those acquired in graph a, indicating that the uneven sample acquisition during the agent's interaction with the environment affects agent policy learning.

Distributed Sample Extraction PPO
According to the problems encountered in Section 4.2, a distributed sampling strategy based on region partitioning was designed, w hich separates execution from training; each executor explores from different regions and interacts with the environment [20].
The algorithm consists of N sub-processes interacting with the environment and collecting samples into their respective buffers. The process initializes the ship's position from its sub-regions ( Figure 7) and interacts with the environment. After collecting T iter in the buffer, the worker calculates the advantage function and discount reward from the collected data and sends it to the main process in a package. When the main process completes training, the workers pull back the updated parameters from the global and use the new policy to interact. After the global network collects the data sent back from workers, N * T samples are shuffled and cut into mini-batches for training. Compared with the A3C algorithm [21], which calculates the gradient by sub-process and uploads the grad to the global, the global network accumulates the grad and optimizes the policy. The newer distributed sample extraction PPO uses sub-processes to collect data and sends them to the main process for optimization. In this sampling method, data from different starting regions can be further integrated, sharing data among workers and increasing sample diversity. The experimental results in Section 5.1 show that this improved method significantly enhanced the robustness of the agent policy, which indicates that the distributed sampling policy based on region division is sufficient to make progress on complex exploration problems. Figure 8 shows the distributed sampling policy based on region division. On the one hand, the number of samples at each initial position is balanced by region interaction. On the other hand, the distributed sampling policy obtains several times the data of the PPO algorithm over the same duration, thus reducing the convergence time. The robustness of the algorithm is significantly improved due to the rising number of samples. Figure 6b shows the test results. Compared with the PPO baseline (Figure 6a), the distributed sampling policy with area division can achieve better performance in the test of starting positions at equal intervals.

Improvement of Action Policy Based on Beta Distribution in Continuous Action Space
In stochastic policy continuous action control problems, the action sample is typically obtained from a Gaussian distribution N µ, σ 2 by predicting the mean and variance in the neural network and optimizing the parameters by computing the gradient of the policy relative to µ and σ. However, the Gaussian distribution is an infinite support distribution. The mean and variance predicted by the neural network are limited by the action space when used in the Gaussian distribution, causing any action sampled from the distribution to be forcibly mapped back into the action space if it exceeds the action boundary. As shown in Figure 9, the Actor calculated the rudder angle according to the specified state and received the distribution parameter of µ = 0.7 and σ = 0.8, due to the rudder angle being limited to (−1, 1). Still, the probability density π θ (a | s) corresponding to the action a at any position in the Gaussian distribution is greater than 0. To analyze this process, assume that the action space of rudder angle A = [−1, 1], with no clipping in the distribution sampling; then, clipping operates only in the environment. The gradient calculation uses the non-clipping action when computing the surrogate objective in PPO; the policy gradient estimator becomes g q = ∇ θ log π θ (a | s)A π (s, a ), where a denotes the clipped action, and A π (s, a ) represents the advantage function generated in the clipped action. By calculating the gap between the gradient E[g ] − ∇J θ (π θ ), the expected policy gradient is as shown: a − Q(s, a)]da ds (12) It could be seen from the formula that the calculation is unbiased within the action space A. This boundary effect caused by the Gaussian distribution produces a deviation in action sampling, which is passed to the calculation of the policy gradient, affecting the training speed.
B(α, β) = Γ(α)Γ(β) Γ(α+β) represents the Beta function, which ensures that f (x; α, β) integrates to 1 over the interval [0, 1]. The Beta distribution takes values in the range x ∈ [0, 1], the shape of the distribution is determined by its shape parameters α and β, and, as a finite-support distribution, it does not produce a Gaussian-like boundary effects. Therefore, it is unbiased in the gradient computation and converges faster than other infinite support distributions.
In this paper, to ensure that the action outputs conform to the different limits in the action space, it is necessary to remap the range of the Beta distribution. Therefore, the Beta policy is defined as π beta θ = f x+c 2 * c ; α, β . The value of parameter c is determined based on the action range in different dimensions of the action space, thereby avoiding the issue of sampling actions beyond the action boundaries.
The Beta distribution is generated by using policy θ to estimate the parameters α and β. Ensuring α, β > 1 makes the distribution a concave probability density curve in the interval with a single-peaked shape. In the experiments, when changing the sampling distribution to Beta with the same parameter settings, the training speed was faster than Gaussian, and the accumulated reward was higher than the Gaussian policy for the same training time.

Experiments
The environment building parameters are shown in Table 2, the training hyperparameters are set as shown in Table 3, and the training scenarios are shown in Table 4. We set the training and testing stage frame size as n = 16, rendering mapping size s = 30, and rendering screen size as 480 * 480. Testing found that rendering the potential field in the range of [80, 160] improved the algorithm performance when the mapping size increased. Still, when m was set above 160, the improvement of the algorithm effect was obviously weakened; thus, we set the potential field remapping parameter to m = 160.

PPO Baseline Compared with Distributed Sample Extraction PPO
To compare the performance of distributed sample extraction PPO with PPO baseline in the path planning task, we compared the accumulated rewards of the training phase, as shown in Figure 10. Analyzing the training results, the accumulated reward of the PPO baseline converged to a value of −210.5 at 230 episodes, but, as the training proceeded, the reward showed a dropping trend, which means that an attenuation occurred in policy, and this status could not improve, even with increased training. The distributed sample extraction PPO algorithm uses multiple workers to collect samples from different regions. On the one hand, this sampling method increases the number of samples; on the other hand, it ensures that the dataset samples are collected in each region in a balanced manner, hence increasing the stability of training. In the improved sampling policy, the accumulated reward from the several workers was significantly higher than that obtained by the random position generation based on the initial region in the PPO baseline.

Comparison of Reinforcement Learning and Traditional Path Planning Algorithms for Planning Paths
To compare the difference between RL and traditional path planning methods in the planned paths, we selected two scenarios. We set the same initial status in each system and used the A* algorithm and RL method to compare. The scenario settings and the results of the planned paths of the two methods are shown in Figure 11.   Figure 11a,b, the trajectory that RL searched was smoother than A*, and the planned paths did not have trouble with being too close to obstacles. Due to a large number of reefs near small islands, deceleration and detours are usually performed when passing through the reef area to ensure navigation safety. The paths of A* did not take into account the problem of the safety of navigation. In contrast, routes with economies and security could be planned with RL.

Beta Policy and Gaussian Policy
To compare the performance of the Beta and Gaussian policies in the path planning task, we trained the Beta and Gaussian policies in two different dimension state-spaces and compared the accumulated reward of each worker, as shown in Figure 12. Meanwhile, the path planning test was driven for both state-spaces, comparing the planned path length distribution and the routed success rate in different policies, as shown in Figure 13. • Comparison of accumulated reward during training: The Beta and Gaussian policies were trained for five rounds. Figure 12 shows the reward with the corresponding mean curves and error intervals. In the worker0 and worker3 cases, the Beta policy had a higher reward than the Gaussian policy during the training, in terms of both data sequence and image. Meanwhile, the Gaussian policy in the worker1 and worker2 cases oscillated in the late training period, i.e., the issue of policy decline occurred. This occurred because the policy certainty of the Gaussian gradually increases during the training, with the gradual decrease in σ leading to the high variance in the Gaussian, which creates a heavy oscillation when it is close to 600 episodes. In worker3, the Gaussian policy based on the data sequence and image had an obvious decay after 100 iterations. Conversely, the Beta policy had a more obvious recovery after 400 episodes, and the reward of the Beta was always greater than the Gaussian at the same iteration. Since the area corresponding to worker4 was closer to the endpoint, Gaussian and Beta demonstrated less difference in accumulative rewards; Comparing the reward convergence of these two state-spaces in training, it was found that the number of iterations required for the image to converge in training was significantly less than that of the data sequence. This is mainly because the image provides more comprehensive global information, enabling the agent to have more feature information to be considered in execution, thus planning a more comprehensive path because the global information is knowable. However, because the state-space based on the data sequence cannot obtain complete information in the environment and can only detect through radar data, the data sequence takes a long time to acquire samples from interacting, so the image converges faster than the data sequence. In summary, the Beta policy learned by each worker significantly improved compared to Gaussian, and, in terms of stability and robustness, Beta was also higher than the Gaussian policy; • Comparison of planning path length: Next, we took the Beta and Gaussian policy models with the same number of training iterations for the tests of data sequence and image. In testing, we used the same test position in Section 4, and each part was used for 200 iterations of path planning. We drew a box diagram of the planned length of each starting position according to the results, and compared the distance between the quartiles and the range of the extreme points by box plots to determine the stability of the planned path length. Figure 13 shows the test results. The Beta policy was significantly better than the Gaussian in the path planning stability in different state-spaces, and this advantage was shown in the case of testing at the same initial position, in which the median of Area1 was nearly 50% higher in the Beta policy (data sequence 171, image 180) than the Gaussian policy (data sequence 329, image 370). The differences between Q3 and Q1 of the Beta policy in Area2, Area4, and Area5 were lower than those of the Gaussian (more obviously in the data sequence case), and the distribution of planning path length was relatively concentrated. As well, the median of the data was significantly lower than that of the Gaussian strategy, indicating that the Beta strategy was more stable during the test; • Planning shortest path length: According to the shortest path length analysis based on the above data, as shown in Table 5 and Figure 13, the path length of Gaussian policy search in regions 1, 4, and 5 was 4.1%, 6.5%, and 43.9% higher than that of Beta, and the path length of Gaussian policy search in regions 2, 3, and 6 was not much different from that of Beta; • Planning success rate: As shown in Table 6, the planning success rates of Gaussian in regions 3, 4, and 5 were not as expected, but, in Beta, the success rate performance was higher in all areas, and the planning success rate was higher than 75% in each region.  In this model, reinforcement learning (RL) used continuous action space exploration for global path planning. It leveraged Box2D for continuous angle manipulation and simulated ship engine propulsion effects in the action space framework. This approach resulted in smoother trajectories compared to traditional algorithms.
To verify whether there were other aspects of performance loss with smoother tracks in RL, we created experiments to compare the performance of distributed extractor PPO based on a Beta policy with the traditional algorithms. We refactored the code of the heatmap obstacles to generate the rasterized map. We discretized a map to a size of 160 * 160 and marked barriers, in particular, to generate the map, which corresponded to simulations. The traditional algorithms searched the track in a rastered map and mapped the nodes searched to the dimensions of the simulation to determine the distance. The comparison between reinforcement learning path planning and traditional path planning algorithms was mainly assessed in three aspects: planning time, the number of planning loop operations, and path length.
The time required for the algorithm to perform path planning in different areas (Table 7) and the number of planning loop operations (Table 8), as well as the planned path length (Table 9), were considered separately during the test. We compared the traditional algorithms IDA*, Dijkstra, A*, and bidirectional A* with the reinforcement learning algorithm TD3 [23], as well as the optimized Gaussian policy PPO algorithm (Normal-PPO) and the Beta policy optimized PPO algorithm (Beta-PPO).
From Tables 7-9, Dijkstra took the longest time of the traditional algorithms and required the most iterations, but obtained the best solution for regions 3 and 4. IDA took significantly longer to track than other algorithms in regions 1, 3, and 5. However, its search paths were the shortest in each region, compared with different algorithms of the same type. The Bi-directional A* took significantly shorter planning time than other algorithms in each area and had the lowest number of calls among traditional algorithms; however, Bi-directional A* did not guarantee that the searched path was the best solution.  During the comparison of reinforcement learning algorithms, the training occurred under the same hyperparameters, including interaction environment parameters, batch size, and number of iterations. The best policy model from the training process was then selected for testing. Upon analyzing the data in Tables 7-9, we observe that the TD3 algorithm performs markedly worse than other algorithms for planned path length in regions 1-6. Additionally, it requires the highest number of algorithm calls among the reinforcement learning algorithms. Based on this analysis, we can conclude that the TD3 algorithm is notably less effective than the optimized PPO algorithm in generating random initial regions.
In the comparison between the optimized Gaussian policy PPO algorithm and the Beta policy optimized PPO algorithm, the Gaussian policy requires fewer algorithm calls than the Beta policy, resulting in better planning time in testing. However, the experimental results show that the Beta policy has a better planning success rate and path length than the Gaussian policy. This is because the Beta policy employs more algorithm calls in the planning process, enabling finer maneuvers during path search that lead to shorter paths in global path planning. When comparing the differences between RL and traditional path planning algorithms in terms of planning time, number of algorithm calls, and path length, as shown in Tables 7-9, we see that RL considers the propulsion characteristics of the ship for continuous control in path planning. This approach yields smoother trajectories than traditional algorithms, and RL's planning task performance and path length are comparable to that of conventional algorithms.

Conclusions
This study aims to address the limitations of both traditional planning algorithms and reinforcement learning algorithms in the context of path planning tasks. Traditional planning algorithms require precise environment modeling, while most reinforcement learning algorithms can only perform path planning in discrete action spaces. To this end, this paper designs a generic environment framework for path planning tasks, builds the environment structure using the Box2D physics engine, and implements the environment design for continuous state-space and action space for reinforcement learning. On this basis, a deep reinforcement learning path planning, without precise modeling to enable dynamic decision-making, is realized by designing two state-spaces based on sensor data and image data.
In addition, this study solves the problems that reinforcement learning faces in path planning tasks, such as the unreasonable design of the reward function, using deterministic policies, and incomplete consideration of ship propulsion characteristics, as well as that traditional planning algorithms increase computation when facing scenario enlargement and other situations and cannot handle high-dimensional data. In this study, a reward function is designed to match the state-space, which considers the distance between the agent and the arrival point as the main part and incorporates the potential-based superposition of the arrival point, obstacles, and boundary control as the auxiliary parts. Moreover, the stochastic policy algorithm PPO is employed as a baseline for path planning. This enables the reinforcement learning algorithm to learn from the continuous angle control and engine propulsion features simulated by Box2D, effectively incorporating the ship's propulsion characteristics into the path planning process.
Building upon this, we propose a distributed sampling PPO algorithm based on the Beta policy. This algorithm addresses the issue of sample collection imbalance in global path planning tasks and effectively improves the performance and robustness of the algorithm. By resolving the action boundary problem associated with Gaussian policy, the proposed approach enhances the algorithm's performance in terms of accumulated reward and planning success rates across different sub-regions.
However, this study still has certain limitations. Specifically, the absence of testing in real marine environments during the experiments may lead to potential performance degradation in practical applications. To address this limitation, we are currently engaged in the deployment of the algorithm on hardware to assess the performance of the deep reinforcement learning global path planning algorithm in the context of sim2real. Furthermore, this study did not include a comparison with genetic algorithms for path planning. As another essential algorithm in path planning, genetic algorithms deserve investigation regarding their performance within the environment framework designed in this study. Thus, we will introduce genetic algorithms into the path planning environment designed in this study to evaluate their performance.

Conflicts of Interest:
The authors declare no conflict of interest.