Next Article in Journal
An Improved Red-Billed Blue Magpie Optimization for Function Optimization and Engineering Problems
Next Article in Special Issue
Image Segmentation-Guided Visual Tracking on a Bio-Inspired Quadruped Robot
Previous Article in Journal
Gait Planning and Load-Bearing Capacity Analysis of Bionic Quadrupedal Robot Actuated by Water Hydraulic Artificial Muscles
Previous Article in Special Issue
Bioinspired Simultaneous Learning and Motion–Force Hybrid Control for Robotic Manipulators Under Multiple Constraints
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Global Path Planning for Land–Air Amphibious Biomimetic Robot Based on Improved PPO

1
The College of Artificial Intelligence and Robotics, Hunan University, Changsha 410082, China
2
Greater Bay Area Institute for Innovation, Hunan University, Guangzhou 511300, China
3
The College of Architecture and Urban Planning, Hunan University, Changsha 410082, China
*
Author to whom correspondence should be addressed.
Biomimetics 2026, 11(1), 25; https://doi.org/10.3390/biomimetics11010025
Submission received: 27 November 2025 / Revised: 21 December 2025 / Accepted: 30 December 2025 / Published: 1 January 2026

Abstract

To address the path planning challenges for land–air amphibious biomimetic robots in unstructured environments, this study proposes a global path planning algorithm based on an Improved Proximal Policy Optimization (IPPO) framework. Unlike traditional single-domain navigation, amphibious robots face significant kinematic discontinuities when switching between terrestrial and aerial modes. To mitigate this, we integrate a Gated Recurrent Unit (GRU) module into the policy network, enabling the agent to capture temporal dependencies and make smoother decisions during mode transitions. Furthermore, to enhance exploration efficiency and stability, we replace the standard Gaussian noise with Ornstein–Uhlenbeck (OU) noise, which generates temporally correlated actions aligned with the robot’s physical inertia. Additionally, a Multi-Head Self-Attention mechanism is introduced to the value network, allowing the agent to dynamically prioritize critical environmental features—such as narrow obstacles—over irrelevant background noise. The simulation results demonstrate that the proposed IPPO algorithm significantly outperforms standard PPO baselines, achieving higher convergence speed, improved path smoothness, and greater success rates in complex amphibious scenarios.

1. Introduction

Land–air amphibious biomimetic robots have garnered significant attention in recent years due to their unique ability to navigate complex, unstructured environments. By combining the high-speed mobility of aerial vehicles with the long-endurance capability of terrestrial robots, these platforms show immense potential in disaster rescue [1], military reconnaissance [2], and environmental monitoring [3]. Mimicking the versatile locomotion of biological organisms like birds [4] or insects, these robots can seamlessly transition between crawling on uneven terrain and aerial maneuvering to overcome obstacles. However, the dual-mode nature of these robots introduces severe challenges in motion control and path planning [5]. Unlike single-domain agents, amphibious robots must deal with distinct kinematic dynamics in different media and, more critically, ensure stability during the impulsive transition between ground and air modes. To provide a comprehensive context for our proposed solution, we review existing literature from three perspectives: classical path planning methods, reinforcement learning for single-domain robots, and emerging research on multi-modal platforms.
In the domain of classical path planning, methods are generally categorized based on the degree of environmental information available: global path planning, which utilizes complete known environmental data [6], and local path planning, which relies on real-time sensor data [7]. This paper focuses on planning an energy-efficient and path-optimal route within global maps, an area where traditional algorithms have established a strong foundation. For global search tasks, the A* algorithm remains a dominant baseline. Recent studies have optimized A* by improving heuristic functions and neighborhood search strategies (e.g., 24-neighborhood search) to significantly enhance search efficiency and path smoothness in static environments [8,9]. In high-dimensional complex environments, sampling-based methods have shown superiority; for instance, He et al. [10] developed an improved RRT*-Connect algorithm to successfully navigate high-DOF manipulators through multi-obstacle narrow passages. Regarding local planning and dynamic obstacle avoidance, P. et al. [11] proposed improved Artificial Potential Field (APF) methods to overcome the inherent “local minima” problem in real-time scenarios. Notably, specific to the domain of amphibious robots, Li et al. [12] recently adapted the Dynamic Window Approach (DWA) to handle the kinematic variations between water and land environments. However, despite these advancements, these traditional methods typically depend heavily on pre-built environmental models or deterministic kinematic constraints. They often struggle to adapt to the highly nonlinear dynamics and cross-domain mode switching (e.g., land-to-air) required by biomimetic robots, resulting in limited processing capabilities for such complex, unstructured scenarios.
To address the limitations of classical methods, Reinforcement Learning (RL)—particularly Deep Reinforcement Learning (DRL)—has been introduced to optimize behavior strategies through real-time trial-and-error interaction without relying on prior environmental knowledge [13]. In the ground domain, DRL has demonstrated robust performance in handling unstructured environments. For instance, Li et al. [14] utilized DRL to enable autonomous navigation of mobile robots in comprehensive unknown environments, significantly improving exploration efficiency. To address dynamic uncertainties and temporal dependencies, Zhang et al. [15] proposed an SAC-LSTM algorithm that integrates Long Short-Term Memory networks, allowing the agent to utilize historical information for rapid decision-making. Furthermore, Nie et al. [16] developed an enhanced Proximal Policy Optimization (PPO) algorithm with sample regularization and adaptive learning rates, which effectively improved path planning stability for Automated Guided Vehicles (AGVs). In the aerial domain, research has focused on 3D navigation and energy optimization. Fei et al. [17] applied DRL to achieve autonomous navigation and collision avoidance for UAVs in unknown environments. Addressing the complexity of three-dimensional spaces, Liu et al. [18] proposed an improved PPO algorithm for 3D path planning, enhancing convergence speed in complex scenarios. Additionally, considering the limited battery life of aerial platforms, Chen et al. [19] introduced an energy-efficient path planning method based on DRL. While recent advancements have begun to explore cross-domain applications—such as Cao et al. [20] applying SAC to amphibious UAVs (water-air) and Mondal et al. [21] investigating cooperative routing and planning for heterogeneous UAV-UGV systems using an attention-aware DRL framework—these works primarily focus on either fluid-air transitions or multi-agent collaboration. There remains a notable gap in unified global path planning for single-agent land–air biomimetic robots, specifically regarding the challenge of kinematic discontinuity during ground-air mode switching.
Emerging research on multi-modal platforms presents a different landscape. While recent hardware innovations have demonstrated the feasibility of such systems—as evidenced by Zhou et al.’s [22] comprehensive review, Mandralis et al.’s [23] mid-air morphing “ATMO” robot, and Zhang et al.’s [24] multimodal soft amphibious robot—motion planning strategies have largely lagged behind. Much of the existing literature focuses on the collaborative control of heterogeneous systems (e.g., UAV-UGV formations) [25], rather than single-agent autonomy. Research on single-agent global path planning is far less developed due to severe cross-domain kinematic constraints during media transitions, a challenge highlighted by Liang et al. [26]. Consequently, applying standard Deep Reinforcement Learning (DRL) frameworks to these complex environments often proves inadequate, manifesting as three critical operational failures specifically affecting amphibious robots. First, regarding temporal blindness, standard algorithms like PPO treat state-action pairs independently. This prevents the agent from retaining momentum information during ‘Kinematic Discontinuities’ (e.g., taking off from a slope), leading to abrupt thrust loss and instability. Second, regarding sparse rewards, unlike 2D rovers, amphibious robots operate in a vast 3D volumetric space where goal-directed signals are rare. This often traps the agent in ‘aimless loitering,’ causing it to miss narrow spatial corridors necessary for flight traversal. Finally, concerning exploration efficiency, generic strategies such as Gaussian noise generate uncorrelated, jerky control signals. For biomimetic robots with complex wheel-leg linkages, these high-frequency oscillations not only hinder convergence but also inflict mechanical stress, risking physical damage to actuators. Therefore, introducing a memory-capable mechanism and enhanced global perception into the decision-making process is crucial for maintaining trajectory smoothness and stability in hybrid environments.
To fill this gap, this paper proposes a global path planning algorithm based on an Improved Proximal Policy Optimization (IPPO) framework. Specifically designed for land–air amphibious biomimetic robots, this method ensures smooth and stable navigation across different media. The main contributions of this paper are summarized as follows:
(1) Amphibious Framework Design: We establish a global path planning framework specifically for land–air amphibious biomimetic robots, incorporating a comprehensive reward function that balances energy efficiency, safety, and mode-switching stability.
(2) Temporal Feature Extraction: A Gated Recurrent Unit (GRU) is integrated into the policy network. This module captures the temporal dependencies of the robot’s motion, effectively eliminating trajectory oscillations during the critical take-off and landing phases.
(3) Safe Exploration Mechanism: We replace the standard Gaussian noise with Ornstein-Uhlenbeck (OU) noise. This strategy generates temporally correlated exploration actions that align with the physical inertia of biomimetic joints, preventing mechanical damage from high-frequency control jitter.
(4) Enhanced Global Perception: A Multi-Head Self-Attention mechanism is introduced to the value network. This allows the agent to dynamically prioritize critical environmental features (e.g., narrow obstacles) over irrelevant background noise, significantly improving the success rate in cluttered environments.

2. Land–Air Amphibious Biomimetic Robot Platform

2.1. Land–Air Amphibious Biomimetic Robot Hardware Platform

The land–air amphibious biomimetic robot has two modes: ground and flight. Its design needs to comprehensively consider the endurance in the form of drones and the passability in the form of unmanned vehicles. For this reason, this study adopts the deformable combination form of “multi-rotor drone + tracked unmanned vehicle”, as shown in Figure 1. Its hardware components include onboard computers, Pixhawk 4 flight control, binocular vision cameras, laser radars, etc.
Figure 2 is a mode diagram of the land–air amphibious biomimetic robot. When the robot is in ground mode, it uses a tracked driving device with stronger environmental adaptability. The flight mode is a four-rotor aircraft. The multi-rotor design allows the robot to take off and land anytime and anywhere, and has good maneuverability. When changing from the unmanned vehicle form to the drone form, the tracks on both sides of the robot can be folded upward through the foldable track arms to transform into a protective frame for the propellers, and the folding time is 2 s. In this study, the flight speed of the land–air amphibious biomimetic robot is set at 2 m/s, and the ground movement speed is set at 1 m/s. Since the flight energy consumption is much higher than the ground energy consumption, the ground energy consumption is set at 46.4 J/m and the flight energy consumption is set at 270.3 J/m.

2.2. Kinematic Consistency and Energy Analysis

To facilitate efficient reinforcement learning training, we adopt a simplified average energy model based on the Cost of Transport (CoT). While flight energy physically depends on complex aerodynamics (e.g., banking angle, acceleration) and ground energy on terrain interaction, modeling these micro-dynamics at the planning level introduces excessive computational overhead and sparse reward noise.
Therefore, we linearize the energy consumption as:
E t o t a l C a i r D a i r + C g r o u d D g r o u d
where D represents the path length. The coefficients C g r o u d = 46.4 J/m and C a i r = 270.3 J/m are derived from the empirical average power consumption of the robot at its nominal cruise velocities.
The justification for this simplification is threefold. First, regarding the macro-decision focus, the primary objective of the IPPO agent is to optimize high-level mode-switching strategies, where the significant magnitude disparity ( 1:6) between ground and air costs serves as the critical decision driver rather than minute fluctuations caused by instantaneous friction or drag. Second, in terms of computational efficiency, adopting a linear energy model significantly accelerates the reward calculation process throughout the millions of training steps required, thereby ensuring the overall feasibility of the algorithm. Finally, this approach enhances robustness; by employing conservative average values, we prevent the agent from exploiting simulation loopholes—such as unrealistic ‘gliding’ without energy cost—and guarantee that the planned paths remain valid and transferable under real-world disturbances.

3. The PPO Algorithm

Given the high-dimensional continuous action space defined by the dual-mode operation (flight and ground driving) described in Section 2, discrete control algorithms are insufficient. Therefore, we adopt the Proximal Policy Optimization (PPO) algorithm as our foundational framework. PPO is selected for its stability in continuous control tasks. However, as noted in our contributions, the standard “Vanilla PPO” lacks the temporal memory and noise filtration capabilities required for amphibious navigation. The mathematical formulation of the standard PPO is presented below, followed by our specific improvements.
The PPO algorithm is an Actor-Critic based method, with its structure illustrated in Figure 3. The core architecture of this algorithm consists of two key components: the policy network (Actor) and the value network (Critic). Specifically, the policy network generates state-to-action mappings by outputting probability distributions over the action space, while the value network focuses on estimating the state-value function to compute advantage functions, thereby guiding policy optimization and enhancing training stability.
The PPO algorithm enhances the policy network’s update process through the introduction of importance sampling and advantage functions, thereby improving sample utilization efficiency. Importance sampling is a method that estimates the expected value of one distribution by sampling data from a known distribution and weighting the samples accordingly. Given a random variable x following probability distribution p(x), the expected value of function f(x) is calculated as follows:
E x ~ p ( x ) [ f ( x ) ] = f ( x ) p ( x ) d x
If sampling directly from p(x) becomes difficult, we may instead sample from an alternative distribution q(x). In this case, the expected value calculation formula becomes Equation (3):
E x ~ p ( x ) [ f ( x ) ] = f ( x ) p ( x ) q ( x ) q ( x ) d x = E x ~ q ( x ) [ f ( x ) p ( x ) q ( x ) ]
In PPO, the goal of the policy network is to maximize the expected return of the new policy. However, since the new policy has not yet interacted with the environment and cannot directly sample data, it needs to rely on samples generated by the old policy. PPO uses importance sampling to update the new policy using sample data generated by the old policy. However, direct use of importance sampling may cause the policy update to be too large, affecting the stability of training. To this end, PPO designs an objective function that contains a clipping term, which constrains the policy update by limiting the range of variation in the ratio of the new and old policies, thereby avoiding the complex optimization process of the KL divergence constraint, significantly improving the computational efficiency and ease of implementation of the algorithm while ensuring performance. The main calculation formulas are shown in Equations (4) and (5):
r t ( θ ) = π θ ( a t | s t ) π θ o l d ( a t | s t )
L t C L I P ( θ ) = E ^ t [ min ( r t ( θ ) A ^ t , c l i p ( r t ( θ ) , 1 ε , 1 + ε ) A ^ t ) ]
where r t ( θ ) represents the probability ratio between the new and old policies for selecting action a t in state s t , A ^ t represents the advantage function, which is usually Generalized Advantage Estimation(GAE), θ represents the parameters of the policy network, and ε is the clipping range hyperparameter, which is used to limit the range of r t ( θ ) , usually set to 0.1 or 0.2. When the advantage function is greater than 0, it indicates that the current action is better than the average level, and its selection probability should be increased. At the same time, the upper limit of the ratio of the new and old strategies should be constrained to prevent excessive updates. On the contrary, when the advantage function is less than 0, it indicates that the current action is poor, and its selection probability should be reduced, and the lower limit of the ratio of the new and old strategies should be constrained to avoid excessive updates. In this way, PPO limits the probability ratio r t ( θ ) of the new and old strategies selecting action a t in state s t to the range of [ 1 ε , 1 + ε ] , thereby constraining the update amplitude of the policy network, preventing drastic fluctuations in policy performance, and ensuring the stability of training. The constraints of the objective function L t C L I P are shown in Figure 4.
In PPO, the value network is updated by minimizing the mean squared error between the estimated state value V θ ( s t ) and the target value V t t a r g e t , as shown in Equation (6):
L t V a l u e ( θ ) = E t [ ( V θ ( s t ) V t t a r g e t ) 2 ]
where θ represents the parameter of the value network.
To encourage exploration and prevent the policy from converging to local optima, PPO introduces an entropy bonus while employing a neural network architecture with shared parameters between the policy and value networks to improve training efficiency. The final loss function is given by Equation (7):
L t ( θ ) = E ^ t [ L t C L I P ( θ ) c 1 L t V a l u e ( θ ) + c 2 S [ π θ ] ( s t ) ]
where c 1 and c 2 are hyperparameters, which control the weights of the value loss term and the entropy reward term, respectively, and S [ π θ ] ( s t ) represents the entropy of strategy π θ in state s t .

4. Improved PPO Algorithm for Land–Air Environments

Although standard PPO provides a baseline for policy learning, its reliance on independent state processing and uncorrelated exploration noise proves inadequate for land–air amphibious robots. Specifically, this limitation exacerbates the ‘kinematic discontinuity’ problem highlighted in the Introduction, resulting in a loss of momentum information during mode transitions and generating mechanically harmful high-frequency control signals. To overcome these challenges, we present a novel IPPO algorithm that offers a holistic solution beyond generic hybrid methods. Our framework synergistically integrates three key enhancements: (1) a Gated Recurrent Unit (GRU) module to capture temporal dependencies for smooth trajectory generation; (2) Ornstein-Uhlenbeck (OU) noise to ensure temporally correlated, mechanically safe exploration; and (3) a Self-Attention mechanism to enhance global perception. These components work together to ensure seamless mode switching and robust path planning. The overall schematic of the proposed IPPO algorithm, detailing the interaction between these modules, is depicted in Figure 5 below.

4.1. Markov Decision Process of Land–Air Robot

4.1.1. State Space

In order to comprehensively describe the real-time state of the land–air amphibious biomimetic robot, this subsection designs its state space, which overall includes the robot’s position, velocity, target information, and environment sensing data. The real-time state of the robot is firstly described by its position coordinates [ x , y , z ] and velocity information [ v x , v y , v z ] , where the position coordinates represent the current position of the robot in the 3D space, and the velocity information reflects the robot’s motion state in the direction of each coordinate axis. In order to guide the robot to move towards the target point, the relative position of the target point and the robot [ Δ x , Δ y , Δ z ] is introduced, which is calculated as:
Δ x = x p x c Δ y = y p y c Δ z = z p z c
where ( x p , y p , z p ) represents the coordinates of the target point, ( x c , y c , z c ) represents the coordinates of the current position. In addition, the Euclidean distance D from the robot to the target point is introduced, and its calculation formula is:
D = Δ x 2 + Δ y 2 + Δ z 2
This distance provides the robot with global positional information about the target point, facilitating path planning and navigation in complex environments. To enhance the robot’s environmental perception capabilities and enable autonomous obstacle avoidance and safe navigation, this study employs a 16-line 3D LiDAR as the primary sensor. Through feature extraction, the system obtains the nearest obstacle distances in 11 directions from each LiDAR data frame: front, left, right, front-left, front-right, front-up, front-down, front-upper-left, front-upper-right, front-lower-left, and front-lower-right, denoted, respectively, as:
[ s 1 , s 2 , s 3 , s 4 , s 5 , s 6 , s 7 , s 8 , s 9 , s 10 , s 11 ]
In summary, the state space O t of the land–air amphibious biomimetic robot at time t can be expressed as:
O t = { x , y , z , v x , v y , v z , Δ x , Δ y , Δ z , D , s 1 , , s 11 }

4.1.2. Action Space

In this paper, the action space of the land–air amphibious biomimetic robot is designed to be continuous, that is, the action space of the robot is described by the velocities v x , v y and v z parallel to the x-axis, y-axis and z-axis of the global coordinate system. Therefore, the action space of the land–air amphibious biomimetic robot is expressed as:
A = { v x , v y , v z }
where the values of v x , v y and v z are in the range [−1, 1].

4.1.3. Reward Function

In reinforcement learning, the agent explores the environment by performing different actions, obtains corresponding reward values, and updates the strategy based on the reward, thereby gradually optimizing the behavior performance. Therefore, the reward function is the core element of completing the path planning task, and the rationality of its design will significantly affect the quality of path planning. However, the traditional reward function design usually only gives positive rewards when the agent reaches the target point, negative rewards when a collision or crossing the boundary occurs, and zero rewards in other cases. This reward function design has the problem of reward sparsity, which will cause the agent to receive delayed feedback when making decisions, thereby affecting strategy learning and may even cause the algorithm to not converge. Therefore, in this subsection, the reward function for path planning of land–air amphibious biomimetic robots is designed, which will be described in the following.
  • Distance reward. The distance reward is designed to encourage the land–air amphibious biomimetic robot to approach its target by measuring the change in Euclidean distance between the robot and the target point from the current timestep to the previous one, thereby rewarding progress made during path planning. Its mathematical expression is:
r d i s t a n c e = η ( D t 1 D t )
where D t and D t 1 represent the Euclidean distances between the robot and the target point at the current timestep and the previous timestep, respectively, and η is the distance reward coefficient.
2.
Altitude penalty. The land–air amphibious biomimetic robot has two modes: flight and ground travel, with the energy consumption of the flight mode being significantly higher than that of the ground mode. To reduce the robot’s energy consumption during flight and encourage ground travel, this paper introduces an altitude penalty. Its mathematical expression is:
r h e i g h t = λ max ( 0 , z t δ )
where z t represents the robot’s current altitude, λ is the altitude penalty coefficient, and δ is the altitude threshold. When the robot’s altitude is below the threshold δ , no additional penalty is applied. However, if the altitude exceeds δ , a certain penalty is imposed.
3.
Collision penalty. To ensure the safety of path planning for the land–air amphibious biomimetic robot in complex environments, this paper introduces a collision penalty to prevent collisions between the robot and obstacles. Its mathematical expression is:
r c o l l i s i o n = μ L c o l l i s i o n
where L c o l l i s i o n serves as the collision flag ( L c o l l i s i o n = 1 when a collision occurs, otherwise L c o l l i s i o n = 0 ), and μ is the collision penalty coefficient. This penalty mechanism effectively guides the robot to proactively avoid obstacle zones during path planning, thereby enhancing both the safety and feasibility of the planned trajectory.
4.
Time penalty. To encourage the land–air amphibious biomimetic robot to accomplish path planning efficiently, this paper introduces a time penalty, designed to prompt the robot to reach the target point quickly while minimizing unnecessary movements, thereby reducing mission execution time. Its mathematical expression is:
r t i m e = k
where k represents the time penalty coefficient. This penalty works by applying a small negative reward at each time step, guiding the robot to optimize path planning, improve execution efficiency, and reduce unnecessary movements, thereby achieving a more efficient path generation strategy.
5.
Smooth reward. In continuous control tasks, excessively abrupt action variations may lead to system instability, increased energy consumption, and trajectory oscillations. To ensure the smoothness of robotic motions, this paper introduces a smooth reward. Its mathematical expression is:
r s m o o t h = ξ a t a t 1 2
where a t a t 1 represents the Euclidean distance variation between consecutive actions, and ξ is the smooth reward coefficient. This reward mechanism penalizes abrupt action changes, guiding the robot to generate smooth trajectories. It ensures motion stability, reduces unnecessary energy expenditure, and enhances control precision.
6.
Obstacle traversal reward. To address scenarios where land–air amphibious biomimetic robots encounter impassable obstacles in ground mode within complex environments, this paper introduces an obstacle traversal reward to incentivize the robot to employ flight mode when necessary for obstacle clearance. Its mathematical expression is:
r o v e r = β L o b s t a c l e e z t h o b s σ
where L o b s t a c l e is a cross-map obstacle detection flag, h o b s represents the obstacle height, z t represents the robot’s current altitude, β is the reward coefficient, and σ is a tuning parameter. This reward reaches its maximum value when the robot’s altitude approaches the obstacle height, thereby effectively compensating for the altitude penalty during flight.
7.
Terminal Reward. When the land–air amphibious biomimetic robot successfully reaches the target point, a significant positive reward is provided to incentivize task completion. Its mathematical expression is:
r t e r m i n a l = R g o a l
where R g o a l represents the terminal reward value, set to 800. This reward can effectively guide the robot toward optimal path planning by minimizing detours and ensuring timely arrival at the target, thereby significantly enhancing planning efficiency.
In summary, the expression of the integrated reward function designed in this paper is:
R t = r d i s t a n c e + r h e i g h t + r c o l l i s i o n + r t i m e + r s m o o t h + r o v e r + r t e r m i n a l
However, simply aggregating these terms is insufficient; the efficacy of the algorithm hinges on the appropriate balancing of their respective coefficients. In terms of parameter configuration, rather than relying on fixed scalar values that limit generalization, we adopted a hierarchical tuning strategy to resolve potential reward interference and ensure robust convergence. Specifically, to prioritize safety, the collision penalty is set significantly higher than the maximum possible accumulated time costs, preventing the agent from intentionally crashing to terminate the episode early. Furthermore, to address the conflict between energy saving and obstacle traversal, the obstacle crossing reward is calibrated to outweigh the altitude penalty, ensuring the agent is motivated to switch to flight mode for overcoming barriers rather than being trapped in a ground-level local optimum. This relative scaling principle ensures that the global objective of safe and efficient navigation is maintained across varying environmental scales.

4.2. Improved Strategy Network with GRU

Land–air amphibious robots differ from standard UAVs or UGVs in that their current state is heavily influenced by previous momentum, especially during takeoff and landing phases. Feed-forward neural networks (as used in vanilla PPO) lack memory of these past states, often resulting in jerky control signals that can destabilize the robot. To address this, we embed a Gated Recurrent Unit (GRU) layer into the policy network. Unlike LSTM which involves complex gating mechanisms, the GRU offers a streamlined architecture that efficiently retains a “memory” of the robot’s kinematic history. This ensures smooth transitions between terrestrial and aerial modes and allows the agent to make decisions based on a trajectory of states rather than a single snapshot.
The Gated Recurrent Unit (GRU) [27], an improved variant of Recurrent Neural Networks (RNN), effectively resolves the long-term dependency issues encountered by traditional RNNs in processing lengthy sequences through its innovative introduction of reset and update gates to regulate information flow. As shown in Figure 6, the GRU architecture primarily consists of these two gating mechanisms, where the reset gate controls the retention proportion of historical state information while the update gate determines how much historical information should be preserved versus new information incorporated in the current state.
As shown in Figure 6, the outputs of the reset gate and update gate are denoted as r t and z t , respectively, with their computational formulas given by Equations (21) and (22):
r t = σ ( W r [ h t 1 , x t ] + b r )
z t = σ ( W z [ h t 1 , x t ] + b z )
where W r and W z represent weight coefficients, b r and b z represent bias terms. Additionally, the GRU introduces the candidate hidden state h ˜ t at the current timestep to capture short-term dependencies, which is combined with the update gate z t to determine the final hidden state h t at the same timestep. The computational formulas for h ˜ t and h t are given by Equations (23) and (24):
h ˜ t = tanh ( W h ˜ [ r t h t 1 , x t ] + b h )
h t = z t h ˜ t + ( 1 z t ) h t 1
By leveraging the selective memory and state update mechanisms of GRU, incorporating it into the policy network of the PPO algorithm can effectively enhance the land–air amphibious biomimetic robot’s ability to extract historical state information, thereby optimizing the policy update process and significantly improving the convergence speed of the algorithm. The architecture of the proposed IPPO algorithm’s policy network is illustrated in Figure 7. The network consists of three layers. The first layer serves as a feature extraction module, where the input state information is passed through a fully connected layer with 128 hidden neurons for linear transformation, followed by a ReLU activation function to introduce nonlinearity and facilitate preliminary feature extraction. The second layer is a GRU, which performs sequential modeling on the extracted features, dynamically updating the hidden state to preserve historical dependencies. The third layer is the decision-making layer, in which the GRU output is fed into two separate fully connected layers, each containing 64 hidden neurons, to produce the mean and standard deviation of a Gaussian distribution that represents the continuous action space. Finally, the robot samples actions from this distribution and calculates the corresponding log-probability, which is then used for policy optimization and gradient-based updates.

4.3. OU Random Noise

Efficient exploration is critical for finding global optimal paths in complex environments. However, the independent Gaussian noise typically used in standard PPO generates uncorrelated, jittery action signals. For a biomimetic robot, such high-frequency oscillations can damage mechanical structures and cause unstable flight attitudes. To mitigate this, we employ the Ornstein-Uhlenbeck (OU) process to generate temporally correlated noise. OU noise models the velocity of a massive Brownian particle under friction, producing smoother exploration trajectories that align better with the physical inertia of the robot, thereby allowing for safe exploration without abrupt control commands. The differential equation form of OU noise is:
d x t = θ ( x t μ ) d t + σ d W t
where x t represents the state, W t represents the Wiener process, μ is the mean reversion rate, θ and σ are hyperparameters. Through discretization, the update rule for OU noise is formulated as:
x t + Δ t = x t θ ( x t μ ) Δ t + σ Δ t ε t
where ε t is a random variable following the standard normal distribution.
The policy network architecture incorporating OU noise is illustrated in Figure 8. In this network, OU noise is additively applied to the action outputs, enabling the robot to generate smooth and coherent action sequences during exploration. This implementation not only enhances exploration efficiency but also significantly improves the stability and safety of path planning for the land–air amphibious biomimetic robot.

4.4. Improved Value Network with Self-Attention Mechanism

In complex environments filled with obstacles, traditional value networks often treat all sensory inputs equally, failing to distinguish between critical threats and irrelevant background noise. This limitation can lead to decision biases, especially when the robot needs to identify narrow passageways. To enhance global perception, we incorporate a Multi-Head Self-Attention mechanism into the value network. This mechanism dynamically computes correlations between different state features and adaptively assigns higher weights to critical environmental information. By focusing on global structural dependencies rather than just local sensor readings, the network significantly improves the accuracy of state-value estimation and the robustness of the planned path.
The multi-head self-attention mechanism [28], an advanced extension of self-attention, employs multiple parallel attention heads to enable simultaneous extraction of diverse representations from heterogeneous feature subspaces. This architecture significantly strengthens the network’s feature extraction capability and flexibility, demonstrating superior performance when processing complex sequential data. The computational process is illustrated in Figure 9.
Assume that there are h attention heads, each with an independent query, key, and value. Each attention head can extract features from the input sequence from different perspectives and learn the association information of the data in different subspaces. Since the parameters of each attention head are independent, they can capture diverse features, thereby enhancing the representation ability of the model. The specific calculation steps of the multi-head self-attention mechanism are as follows:
  • Linear transformation: Assume that the input sequence is X = [ x 1 , , x N ] R N × D , where N is the length of the input sequence and D is the feature dimension of the input. For each attention head i, through different linear transformation matrices W i Q , W i k and W i V , we obtain the query matrix Q i , key matrix K i and value matrix. The calculation formula is shown in Formula (27):
    Q i = X W i Q K i = X W i K V i = X W i V
  • Where W i Q R D × D q , W i K R D × D k , W i V R D × D v .
  • Calculate the attention weight: For each attention head, the dot product between the query matrix Q i and the transpose of the key matrix K i is calculated, and then divided by the scaling factor D k , and then normalized by the softmax function to obtain the attention weight matrix A i . The calculation formula is shown in Formula (28):
    A i = Q i K i T D k
  • Weighted summation: Multiply the attention weight matrix A i by the value matrix V i and perform a weighted summation to obtain the output o u t p u t i of each head. The calculation formula is shown in Formula (29):
    o u t p u t i = A i V i
  • Concatenation and Fusion: The outputs of all h attention heads are concatenated and then transformed through a linear projection matrix W o to produce the final output. The calculation formula is shown in Formula (30):
    o u t p u t = C o n c a t ( o u t p u t 1 , o u t p u t 2 , , o u t p u t h ) W o
The value network structure of this paper, combined with the self-attention mechanism, is shown in Figure 10. First, the input state is subjected to preliminary feature extraction through a multilayer perceptron (MLP), and then nonlinearity is introduced through the ReLU activation function to enhance the model’s expressiveness. Next, the extracted features enter the multi-head self-attention mechanism layer, and the dependencies between different features are calculated in parallel through multiple attention heads to improve the network’s perception and robustness of global information. Finally, the features enhanced by the self-attention mechanism are input into the fully connected layer to output the state value. The improved value network can more accurately evaluate the value of the current state, accelerate convergence, and improve the overall performance and reliability of the algorithm.
To present the overall method framework more clearly, the algorithm pseudocode is shown in Algorithm 1:
Algorithm 1: Improved Proximal Policy Optimization (IPPO) for Amphibious Path Planning
Biomimetics 11 00025 i001

5. Experimental Results and Analysis

5.1. Environment and Parameter Configuration

The simulation platform is implemented using the Simulation Open Framework Architecture (SOFA) framework. We selected SOFA over standard rigid-body simulators (e.g., Gazebo) due to its superior capability in handling multi-physics interactions. This feature is particularly crucial for our biomimetic robot, as it allows for accurate modeling of the complex contact mechanics between the robot’s wheel-leg structure and deformable terrains, as well as the aerodynamic disturbances encountered during mode switching. Based on this high-fidelity platform, we developed three simulation environments with varying levels of complexity (Env 1 to Env 3) and encapsulated them as standard Gym interfaces to facilitate autonomous algorithm training. Detailed configurations of the specific geometric features, obstacle distribution, and computational resources of the simulation environments are summarized in Table 1. As illustrated in Figure 11, the spatial dimensions of all maps are standardized to 50 m × 50 m × 20 m, where the red dot and the red five-pointed star represent the starting point and the target destination, respectively. The environmental complexity increases progressively from Env 1 to Env 3, providing a rigorous testbed to verify the effectiveness and generalization capabilities of the proposed algorithm across diverse scenarios. The algorithm-specific hyperparameters were initialized based on established literature [16,18,27,28] and further fine-tuned through preliminary trials in the simulated amphibious environment. The specific parameter values are detailed in Table 2. To ensure fair comparison, all baseline algorithms were trained under identical environmental conditions.
To quantitatively evaluate the reliability of the proposed algorithm, we established clear termination criteria for each experimental episode. An episode is recorded as a Success when the robot successfully navigates from the starting point to the vicinity of the target destination without any collisions. Conversely, an episode is marked as a Failure if the robot collides with an obstacle or fails to reach the target within a reasonable operational timeframe. Based on these criteria, the Success Rate, which serves as a primary performance metric in our comparative analysis, is defined as the percentage of successful trials out of the total number of evaluation episodes.

5.2. Ablation Experiment

In order to verify the impact of the GRU network, OU random noise and self-attention mechanism of the IPPO improvements on the algorithm performance, this section conducts an ablation experiment in env 3. Env 3 has high complexity and intricate obstacle distribution, which can fully test the performance of the algorithm model. The algorithm settings for experimental comparison are shown in Table 3. By comparing the combinations of different improvements (“✓” means included, and “✗” means not included), the contribution of each improved part to the algorithm performance is analyzed.
The results of the ablation experiment are shown in Figure 12 and Figure 13. Figure 12 is the reward curve, and Figure 13 is the curve of calculating the success rate every 100 rounds. It can be seen from the figure that the traditional PPO algorithm has the slowest convergence speed, the final reward value is stable at about 110, and the success rate is only 70%, indicating that its strategy optimization efficiency is insufficient in complex environments.
The PPO_GRU algorithm can effectively capture the long-term dependencies in the sequence by introducing the GRU network. Its final reward value is stable at about 130, and the success rate reaches 75%, which is significantly improved compared with PPO. The PPO_OU algorithm improves the exploration ability by introducing OU noise. Its final reward value is about 120 and the success rate is 72%, which is slightly lower than PPO_GRU, but better than PPO. This shows that OU noise plays a certain positive role in the exploration process, but its effect is relatively limited. The PPO_ATTEN algorithm achieves a reasonable distribution of global state weights through the self-attention mechanism, significantly improving the directionality of strategy optimization. Its final reward value reaches about 130. The success rate reaches 70% within 1650 rounds and finally stabilizes at 80%, showing a high task completion efficiency. The IPPO algorithm performs best, with its final reward value stabilized at 150 and rapidly growing within 500 rounds, a success rate of up to 85%. In summary, the improved strategies GRU network, OU noise, and self-attention mechanism proposed in this paper have improved the performance of the algorithm to varying degrees, verifying its effectiveness.

5.3. Controlled Experiment

In the previous section, the effectiveness of the IPPO algorithm improvement strategy was verified through ablation experiments. In order to further verify the advantages of the IPPO algorithm proposed in this paper in the path planning of land–air amphibious biomimetic robots, this section designed and carried out comparative experiments to compare the performance of the IPPO algorithm with two traditional deep reinforcement learning algorithms PPO and DDPG. DDPG is an off-policy algorithm for continuous control that simultaneously learns a Q-function and a policy. These baselines were selected to represent different categories of RL algorithms: PPO as a classic on-policy method, and DDPG as representative off-policy methods. The experiments were conducted in env 1, env 2, and env 3, respectively. Each algorithm was run independently 25 times in each environment, and the average value of its performance indicators was calculated to ensure the stability and reliability of the results. By comparing and analyzing the performance of the three algorithms in different environments, the effectiveness, generalization, and robustness of the IPPO algorithm in path planning tasks are comprehensively evaluated. The path planning results in different environments are shown in Figure 14, Figure 15 and Figure 16, respectively. In these visualizations, the Ground Mode corresponds to low-altitude segments for energy saving, while the Flight Mode is characterized by vertical ascents specifically triggered for obstacle avoidance.
In this experiment, to simplify the complexity of the model, the amphibious robot is regarded as a point mass. Since the robot itself has a certain height, a height threshold is set in the experiment: when the flight height is higher than 0.2 m, it is considered to be in flight state, and when it is lower than 0.2 m, it is considered to be in ground driving state. At the same time, the flight energy consumption and ground driving energy consumption are set to 270.3 J/m and 46.4 J/m, respectively.
Table 4 Comparative experimental results under three environments records the performance indicators of each algorithm during the experiment, including average path length, average flight path length, average ground path length, and average energy consumption. It can be seen from Figure 14, Figure 15 and Figure 16 and Table 4 that the IPPO algorithm has significant advantages in the path planning task of land–air amphibious biomimetic robots, especially in terms of energy consumption optimization and environmental adaptability. With the increase in environmental complexity, IPPO can better balance path length and energy consumption, showing good robustness and generalization ability. In env 1 and env 2, the path length of the PPO algorithm is 76.041 m and 72.087 m, respectively, which is lower than that of the IPPO algorithm, but its energy consumption is much higher than that of IPPO. This implies that IPPO generates the most path-optimal trajectories, effectively minimizing redundant maneuvers and loitering in complex terrains. This efficiency is largely attributed to the temporal feature extraction capability of the GRU module, which smooths the control signals during mode transitions and reduces oscillatory behaviors. The path length of the DDPG algorithm in env 1 is lower than that of IPPO, and in env 2 is higher than that of IPPO, indicating that with the increase in environmental complexity, the stability of DDPG is poor. In addition, the energy consumption of DDPG in these two environments is higher than that of IPPO. In contrast, although the path length of the IPPO algorithm is slightly longer, it effectively reduces energy consumption by reducing the flight path and increasing ground travel. The energy consumption in env 1 and env 2 is 9.204 kJ and 12.661 kJ, respectively, which is significantly better than the other two algorithms. In env 3, as the complexity of the environment increases further, the IPPO algorithm still shows strong robustness and adaptability. Its path length is 75.181 m, close to PPO and DDPG, but the energy consumption is only 9.881 kJ, which is much lower than DDPG’s 18.778 kJ and PPO’s 17.416 kJ. In summary, the IPPO algorithm performs well in the path planning task of land–air amphibious biomimetic robots. It can effectively reduce energy consumption without significantly increasing the path length, giving full play to the advantages of land–air amphibious biomimetic robots. In addition, as the complexity of the environment increases, the IPPO algorithm remains stable and can flexibly respond to complex environmental changes, further demonstrating its superior generalization and reliability.
To further evaluate the performance of the IPPO algorithm, this section statistically analyzes the average rewards and success rates during the experiment. The results are shown in Table 5 and visualized as shown in Figure 17.
As can be seen from Table 5 and Figure 17, the average rewards and success rates of the IPPO algorithm in different environments are significantly better than those of DDPG and PPO, further verifying its superior performance in the path planning task of land -air amphibious robots. In terms of average reward, the average reward values of the IPPO algorithm in env 1, env 2, and env 3 are 162.2, 154.1, and 152.3, respectively, which are significantly higher than those of DDPG and PPO. In three environments, the average reward value of IPPO is about 25% higher than that of DDPG and PPO, indicating that the IPPO algorithm can effectively improve the quality of task completion. In terms of success rate, the IPPO algorithm also performed outstandingly, with success rates of 93.0%, 89.0% and 85.7% in env 1, env 2 and env 3, respectively, which are higher than DDPG and PPO. In addition, as can be seen from Figure 17, with the increase in environmental complexity, the average reward value and success rate of DDPG and PPO both show a downward trend, while the performance of the IPPO algorithm is relatively stable, which further verifies the environmental adaptability and generalization ability of the IPPO algorithm. In summary, the IPPO algorithm shows significant advantages in the path planning task of land–air amphibious biomimetic robots, can effectively improve the quality and success rate of task completion, and maintain strong adaptability and stability in complex environments, which fully demonstrates that the IPPO algorithm proposed in this paper has certain advantages.
It is important to note that the quantitative results presented in Table 5 are based on average performance metrics. While rigorous statistical tests (e.g., t-tests or ANOVA) were not conducted due to the computational constraints of large-scale reinforcement learning training, the magnitude of improvement observed in the proposed IPPO is substantial. For instance, the Success Rate surpasses the baseline DDPG by approximately 18%, and Energy Consumption is reduced by a significant margin. These large performance gaps suggest that the superiority of the IPPO framework is driven by the structural advantages of the GRU and Attention mechanisms rather than stochastic variance.

6. Conclusions

In this paper, we proposed a global path planning framework based on Improved Proximal Policy Optimization (IPPO), specifically tailored for land–air amphibious biomimetic robots to address the challenges of kinematic discontinuities and complex environmental interference. The proposed approach integrates a Gated Recurrent Unit (GRU), Ornstein–Uhlenbeck (OU) noise, and a Self-Attention mechanism into the standard PPO architecture to enhance robustness and adaptability.
The experimental results indicate that the integration of the GRU module successfully enables the agent to capture temporal dependencies, thereby reducing trajectory oscillations during mode switching and ensuring smoother transitions compared to memory-less baselines. Furthermore, the employment of OU noise facilitates temporally correlated exploration; unlike standard Gaussian noise, this strategy generates physically feasible control signals that mitigate the risk of mechanical damage associated with high-frequency jitter while improving convergence speed. Crucially, the Self-Attention mechanism effectively enhances global perception, allowing the agent to optimize the trade-off between trajectory length and energy efficiency. Quantitative comparisons based on average performance metrics demonstrate that the IPPO algorithm achieves a higher success rate and significantly lower energy consumption than standard PPO and DDPG baselines.
Despite these promising results, this study is subject to certain limitations, particularly the reliance on average metrics due to computational constraints and the exclusive use of a simulation environment. To address these challenges and advance the field, future research will focus on bridging the ‘Sim-to-Real’ gap by employing Domain Randomization techniques. This approach aims to train policies robust to real-world uncertainties, such as sensor noise and variable terrain friction, thereby facilitating the transfer of the IPPO algorithm to physical amphibious prototypes. Furthermore, we intend to extend the framework’s adaptability to dynamic environments by incorporating force-sensing feedback to mitigate unpredictable aerodynamic disturbances, such as wind gusts. Finally, integrating multi-modal sensor fusion, specifically by combining visual data from depth cameras with current state inputs, will be explored to enhance local perception fidelity, enabling more precise obstacle avoidance and autonomous landing site selection in complex, unstructured terrains.

Author Contributions

W.J. and J.L. designed the research methods and wrote the main body of the paper, while W.W. and Y.W. assisted in writing the entire paper. All authors contributed to implementing the software and conceiving the experiments. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Guangdong S&T Programme with grant No. 2024B0101030002, the National Natural Science Foundation of China under Grant 62473138, 62173132 and 62573186, the project of Yuelushan Center for Industrial Innovation 2025YCII0107, the Project of Natural Science Foundation Youth Enhancement Program of Guangdong Province under Grant 2024A1515030184, the science and technology innovation Program of Hunan Province 2025RC3075, and the Project of Guangzhou City Zengcheng District Key Research and Development under Grant 2024ZCKJ01.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

We sincerely appreciate the constructive comments and suggestions of the anonymous reviewers, which have greatly helped to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
PPOProximal Policy Optimization
IPPOImproved Proximal Policy Optimization
GRUGate Recurrent Unit

References

  1. Kim, C.; Lee, K.; Ryu, S.; Seo, T. Amphibious robot with self-rotating paddle-wheel mechanism. IEEE/ASME Trans. Mechatron. 2023, 28, 1836–1843. [Google Scholar] [CrossRef]
  2. Zhang, H.; Zhu, Y.; Yang, J.; Zhao, J. A bioinspired paddle-wheeled leg amphibious robot with environment-adaptive autonomously. IEEE/ASME Trans. Mechatron. 2024, 30, 15–26. [Google Scholar] [CrossRef]
  3. Chen, L.; Cui, R.; Yan, W.; Xu, H.; Zhang, S.; Yu, H. Terrain-adaptive locomotion control for an underwater hexapod robot: Sensing leg–terrain interaction with proprioceptive sensors. IEEE Robot. Autom. Mag. 2023, 31, 41–52. [Google Scholar] [CrossRef]
  4. Speciale, C.; Milana, S.; Carcaterra, A.; Concilio, A. A Review of Bio-Inspired Perching Mechanisms for Flapping-Wing Robots. Biomimetics 2025, 10, 666. [Google Scholar] [CrossRef] [PubMed]
  5. Zhang, D.; Xu, M.; Zhu, P.; Guo, C.; Zhong, Z.; Lu, H.; Zheng, Z. The development of a novel terrestrial/aerial robot: Autonomous quadrotor tilting hybrid robot. Robotica 2024, 42, 118–138. [Google Scholar] [CrossRef]
  6. Alexander, A.; Venkatesan, K.; Mounsef, J.; Ramanujam, K. A Comprehensive Survey of Path Planning Algorithms for Autonomous Systems and Mobile Robots: Traditional and Modern Approaches. IEEE Access 2025, 13, 176287–176326. [Google Scholar] [CrossRef]
  7. Qin, H.; Shao, S.; Wang, T.; Yu, X.; Jiang, Y.; Cao, Z. Review of autonomous path planning algorithms for mobile robots. Drones 2023, 7, 211. [Google Scholar] [CrossRef]
  8. Fu, X.; Huang, Z.; Zhang, G.; Wang, W.; Wang, J. Research on path planning of mobile robots based on improved A* algorithm. PeerJ Comput. Sci. 2025, 11, e2691. [Google Scholar] [CrossRef] [PubMed]
  9. Wang, X.; Xu, H.; Miao, P.; Petrovic, B.; Rodic, A.; Wang, Z. Path Planning for Mobile Robots Based on Improved A* Algorithm. In Proceedings of the International Conference on Robotics, Automation and Intelligent Control (ICRAIC), Zhangjiajie, China, 22–24 December 2023; pp. 382–387. [Google Scholar]
  10. He, X.; Zhou, Y.; Liu, H.; Shang, W. Improved RRT*-Connect Manipulator Path Planning in a Multi-Obstacle Narrow Environment. Sensors 2025, 25, 2364. [Google Scholar] [CrossRef] [PubMed]
  11. Yan, P.; Yan, Z.; Zheng, H.; Guo, J. Real Time Robot Path Planning Method Based on Improved Artificial Potential Field Method. In Proceedings of the Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 4814–4820. [Google Scholar]
  12. Li, Y.; Zhu, Q. Local path planning based on improved Dynamic window approach. In Proceedings of the Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 4291–4295. [Google Scholar]
  13. Xu, X.; Zeng, J.; Zhao, Y.; Lü, X. Research on global path planning algorithm for mobile robots based on improved A. Expert Syst. Appl. 2024, 243, 122922. [Google Scholar] [CrossRef]
  14. Bai, Z.; Pang, H.; He, Z.; Zhao, B.; Wang, T. Path Planning of Autonomous Mobile Robot in Comprehensive Unknown Environment Using Deep Reinforcement Learning. IEEE Internet Things J. 2024, 11, 22153–22166. [Google Scholar] [CrossRef]
  15. Zhang, Y.; Chen, P. Path Planning of a Mobile Robot for a Dynamic Indoor Environment Based on an SAC-LSTM Algorithm. Sensors 2023, 23, 9802. [Google Scholar] [CrossRef] [PubMed]
  16. Nie, J.; Zhang, G.; Lu, X.; Wang, H.; Sheng, C.; Sun, L. Reinforcement learning method based on sample regularization and adaptive learning rate for AGV path planning. Neurocomputing 2025, 614, 128820. [Google Scholar] [CrossRef]
  17. Fei, W.; Xiaoping, Z.; Zhou, Z.; Yang, T. Deep-reinforcement-learning-based UAV autonomous navigation and collision avoidance in unknown environments. Chin. J. Aeronaut. 2024, 37, 237–257. [Google Scholar]
  18. Qi, C.; Wu, C.; Lei, L.; Li, X.; Cong, P. UAV path planning based on the improved PPO algorithm. In Proceedings of the Asia Conference on Advanced Robotics, Automation, and Control Engineering (ARACE), Qingdao, China, 26–28 August 2022. [Google Scholar]
  19. Chen, S.; Mo, Y.; Wu, X.; Xiao, J.; Liu, Q. Reinforcement Learning-Based Energy-Saving Path Planning for UAVs in Turbulent Wind. Electronics 2024, 13, 3190. [Google Scholar] [CrossRef]
  20. Cao, X.; Zhang, J.; Xiang, Y.; Yan, Z. SAC-based path planning for amphibious UAVs: A maximum entropy deep reinforcement learning approach. In Proceedings of the International Conference on Image Processing, Intelligent Control and Computer Engineering, Hangzhou, China, 19–25 October 2025. [Google Scholar]
  21. Mondal, M.S.; Ramasamy, S.; Pranav, A. An Attention-Aware Deep Reinforcement Learning Framework for Cooperative UAV-UGV Routing. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 13687–13693. [Google Scholar]
  22. Zhou, X.; Zhong, H.; Zhang, H.; He, W.; Hua, H.; Wang, Y. Current Status, Challenges, and Prospects for New Types of Aerial Robots (NTARs). Engineering 2024, 41, 19–34. [Google Scholar] [CrossRef]
  23. Mandralis, I.; Nemovi, R.; Ramezani, A.; Murray, R.M.; Gharib , M. ATMO: An aerially transforming morphobot for dynamic ground-aerial transition. Commun. Eng. 2025, 4, 74. [Google Scholar] [CrossRef] [PubMed]
  24. Fang, F.; Zhou, J.; Zhang, Y.; Yi, Y.; Huang, Z.; Feng, Y.; Tao, K.; Li, W.; Zhang, W. A Multimodal Amphibious Robot Driven by Soft Electrohydraulic Flippers. Cyborg Bionic Syst. 2025, 6, 0253. [Google Scholar] [CrossRef] [PubMed]
  25. Chen, L.; Xiao, J.; Teo, C.W.R.; Li, J.; Feroskhan, M. Air–Ground Collaborative Control for Angle-Specified Heterogeneous Formations. IEEE Trans. Intell. Veh. 2025, 10, 1483–1497. [Google Scholar] [CrossRef]
  26. Liang, D.; Huang, X.; Xue, Z.; Li, P. Path planning for amphibious unmanned ground vehicles under cross-domain constraints. Intel. Serv. Robot. 2025, 18, 1381–1416. [Google Scholar] [CrossRef]
  27. Singh, B.; Patel, S.; Vijayvargiya, A.; Kumar, R. Data-driven gait model for bipedal locomotion over continuous changing speeds and inclines. Auton. Robot. 2023, 47, 753–769. [Google Scholar] [CrossRef]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Figure 1. Appearance and hardware composition of the land–air amphibious biomimetic robot.
Figure 1. Appearance and hardware composition of the land–air amphibious biomimetic robot.
Biomimetics 11 00025 g001
Figure 2. Mode of Land–air amphibious biomimetic robot.
Figure 2. Mode of Land–air amphibious biomimetic robot.
Biomimetics 11 00025 g002
Figure 3. PPO algorithm framework diagram.
Figure 3. PPO algorithm framework diagram.
Biomimetics 11 00025 g003
Figure 4. Objective function constraint range. The red dots represent examples where the probability ratio rt(θ) falls within the unclipped interval (i.e., the update is not clipped by ε), indicating that the standard gradient update is active.
Figure 4. Objective function constraint range. The red dots represent examples where the probability ratio rt(θ) falls within the unclipped interval (i.e., the update is not clipped by ε), indicating that the standard gradient update is active.
Biomimetics 11 00025 g004
Figure 5. The overall framework of the proposed IPPO algorithm. The system comprises Environment Interaction, the Agent, and Learning modules. The Agent integrates a GRU for kinematic stability, Multi-Head Self-Attention for global perception, and OU noise for smooth exploration. Interactions are stored in the Replay Buffer to drive iterative PPO updates.
Figure 5. The overall framework of the proposed IPPO algorithm. The system comprises Environment Interaction, the Agent, and Learning modules. The Agent integrates a GRU for kinematic stability, Multi-Head Self-Attention for global perception, and OU noise for smooth exploration. Interactions are stored in the Replay Buffer to drive iterative PPO updates.
Biomimetics 11 00025 g005
Figure 6. GRU network architecture.
Figure 6. GRU network architecture.
Biomimetics 11 00025 g006
Figure 7. Policy Network Architecture of IPPO. This shows a three-layer network: feature extraction with ReLU, GRU for sequential modeling, and outputs for action mean and deviation. It highlights GRU’s role in optimizing continuous actions, backing the enhanced convergence speed in complex environments.
Figure 7. Policy Network Architecture of IPPO. This shows a three-layer network: feature extraction with ReLU, GRU for sequential modeling, and outputs for action mean and deviation. It highlights GRU’s role in optimizing continuous actions, backing the enhanced convergence speed in complex environments.
Biomimetics 11 00025 g007
Figure 8. OU Noise Exploration Strategy.
Figure 8. OU Noise Exploration Strategy.
Biomimetics 11 00025 g008
Figure 9. Computation process of multi-head self-attention. Diagram outlines linear transformations, attention weight calculation, weighted summation, and concatenation for multiple heads. It explains dynamic feature correlation, supporting better global perception in cluttered environments.
Figure 9. Computation process of multi-head self-attention. Diagram outlines linear transformations, attention weight calculation, weighted summation, and concatenation for multiple heads. It explains dynamic feature correlation, supporting better global perception in cluttered environments.
Biomimetics 11 00025 g009
Figure 10. Value Network Architecture of IPPO. This depicts input through MLP with ReLU, multi-head self-attention, and final state value output. It shows how attention enhances state evaluation, supporting improved robustness and accuracy in path planning.
Figure 10. Value Network Architecture of IPPO. This depicts input through MLP with ReLU, multi-head self-attention, and final state value output. It shows how attention enhances state evaluation, supporting improved robustness and accuracy in path planning.
Biomimetics 11 00025 g010
Figure 11. D simulation environment. The image displays three maps (Env 1–3) of 50 m × 50 m × 20 m with increasing obstacle complexity, marked start (red dot) and target (red star). It visualizes the testbed, supporting the algorithm’s generalization across varying environmental challenges.
Figure 11. D simulation environment. The image displays three maps (Env 1–3) of 50 m × 50 m × 20 m with increasing obstacle complexity, marked start (red dot) and target (red star). It visualizes the testbed, supporting the algorithm’s generalization across varying environmental challenges.
Biomimetics 11 00025 g011
Figure 12. Average cumulative reward curve during training. The x-axis represents the number of training episodes, and the y-axis indicates the total dimensionless reward score. The curve is smoothed using a sliding window to highlight the overall convergence trend.
Figure 12. Average cumulative reward curve during training. The x-axis represents the number of training episodes, and the y-axis indicates the total dimensionless reward score. The curve is smoothed using a sliding window to highlight the overall convergence trend.
Biomimetics 11 00025 g012
Figure 13. Success rate evaluation curve. The y-axis displays the success rate (ranging from 0 to 1), calculated as the ratio of successful arrivals to total attempts. The x-axis denotes the training progression in episodes.
Figure 13. Success rate evaluation curve. The y-axis displays the success rate (ranging from 0 to 1), calculated as the ratio of successful arrivals to total attempts. The x-axis denotes the training progression in episodes.
Biomimetics 11 00025 g013
Figure 14. Path planning results in env 1. Trajectories of IPPO, PPO, and DDPG in a simple environment, with mode switches. IPPO’s path balances length and energy, supporting lower consumption via more ground travel.
Figure 14. Path planning results in env 1. Trajectories of IPPO, PPO, and DDPG in a simple environment, with mode switches. IPPO’s path balances length and energy, supporting lower consumption via more ground travel.
Biomimetics 11 00025 g014
Figure 15. Path planning results in env 2. Paths in medium-complexity env, showing IPPO’s efficient mode use. It achieves shorter flight segments, backing claims of energy optimization and adaptability as complexity increases.
Figure 15. Path planning results in env 2. Paths in medium-complexity env, showing IPPO’s efficient mode use. It achieves shorter flight segments, backing claims of energy optimization and adaptability as complexity increases.
Biomimetics 11 00025 g015
Figure 16. Path planning results in env 3. Trajectories in high-complexity env, with IPPO maintaining low energy despite obstacles. This illustrates robustness, supporting.
Figure 16. Path planning results in env 3. Trajectories in high-complexity env, with IPPO maintaining low energy despite obstacles. This illustrates robustness, supporting.
Biomimetics 11 00025 g016
Figure 17. Visual comparison of average rewards and success rates. Bar charts for rewards and rates across envs 1–3. IPPO leads (e.g., 93% success in env 1), confirming stable high performance and environmental adaptability as complexity rises.
Figure 17. Visual comparison of average rewards and success rates. Bar charts for rewards and rates across envs 1–3. IPPO leads (e.g., 93% success in env 1), confirming stable high performance and environmental adaptability as complexity rises.
Biomimetics 11 00025 g017
Table 1. Comprehensive summary of simulation environment configuration.
Table 1. Comprehensive summary of simulation environment configuration.
CategoryParametersValue/Description
Simulation PlatformPhysics EngineSOFA
Global EnvironmentMap Dimensions50 m × 50 m × 20m
Obstacle DistributionSpatially Dispersed,
Locally Clustered
Obstacle Density3∼5%
Obstacle GeometryCylinders: Radius 0.5 m–2.0 m
Cuboids: Edge length 1.0 m–3.0 m
Walls and irregular shapes
Computational SetupHardwareCPU: Intel Core i7-14700KF
GPU: NVIDIA GeForce RTX 4060Ti
RAM: 32 GB
Software StackSystem: Ubuntu 20.04
Frameworks: Python 3.8, PyTorch 1.7.0.
Table 2. Hyperparameter settings of IPPO algorithm.
Table 2. Hyperparameter settings of IPPO algorithm.
ParametersValue
Training rounds4100
Rewards discounts0.995
Learning rate0.00025
Clipping range0.2
GAE factor0.95
Batch_size256
Maximum steps per episode800
OptimizerAdam
( η ,   λ ,   δ ,   μ ) (1.0, 5.0, 0.2, 100)
( k ,   ξ ,   β ,   σ ) (0.1, 0.1, 50, 0.5)
Table 3. Ablation experiment configuration.
Table 3. Ablation experiment configuration.
MethodsGRUOUSelf_Attention
PPO
PPO_ATTEN
PPO_OU
PPO_GRU
IPPO(Ours)
Table 4. Comparative experimental results under three environments.
Table 4. Comparative experimental results under three environments.
EnvAlgorithmAverage Path Length/mAverage Flight Path Length/mAverage Ground Path Length/mAverage Energy Consumption/kJ
Env 1DDPG74.04969.7484.30119.052
PPO76.04172.5363.50519.769
IPPO (Ours)79.16424.70254.4629.204
Env 2DDPG83.92780.9742.95322.024
PPO72.08769.6752.41218.945
IPPO (Ours)81.28039.70541.57512.661
Env 3DDPG71.81668.9852.83118.778
PPO73.08662.63910.44717.416
IPPO (Ours)75.18128.55046.6319.881
Table 4 details the quantitative performance of different algorithms across mixed environments. While multiple kinematic metrics are presented to provide a comprehensive view, the most critical performance differentiator is the ‘Energy Consumption’.
Table 5. Average rewards and success rates in different environments.
Table 5. Average rewards and success rates in different environments.
EnvAlgorithmAverage RewardsAverage Success Rates
Env 1DDPG122.482.6
PPO120.684.0
IPPO (Ours)162.293.0
Env 2DDPG112.670.5
PPO117.573.3
IPPO (Ours)154.189.0
Env 3DDPG103.666.0
PPO113.270.8
IPPO (Ours)152.385.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, W.; Liu, J.; Wang, W.; Wang, Y. Global Path Planning for Land–Air Amphibious Biomimetic Robot Based on Improved PPO. Biomimetics 2026, 11, 25. https://doi.org/10.3390/biomimetics11010025

AMA Style

Jiang W, Liu J, Wang W, Wang Y. Global Path Planning for Land–Air Amphibious Biomimetic Robot Based on Improved PPO. Biomimetics. 2026; 11(1):25. https://doi.org/10.3390/biomimetics11010025

Chicago/Turabian Style

Jiang, Weilai, Jingwei Liu, Wei Wang, and Yaonan Wang. 2026. "Global Path Planning for Land–Air Amphibious Biomimetic Robot Based on Improved PPO" Biomimetics 11, no. 1: 25. https://doi.org/10.3390/biomimetics11010025

APA Style

Jiang, W., Liu, J., Wang, W., & Wang, Y. (2026). Global Path Planning for Land–Air Amphibious Biomimetic Robot Based on Improved PPO. Biomimetics, 11(1), 25. https://doi.org/10.3390/biomimetics11010025

Article Metrics

Back to TopTop