1. Introduction
Autonomous tracked amphibious robotic systems capable of operating seamlessly across water and land environments play an increasingly important role in coastal inspection [
1], environmental monitoring [
2], disaster rescue [
3], and maritime transportation applications [
4]. Compared with single-medium robotic systems, amphibious platforms provide superior mission flexibility and accessibility in complex terrains where water and land coexist [
5]. However, enabling robots to autonomously navigate across heterogeneous environments remains a fundamental challenge [
6], as water–land transitions involve discontinuous dynamics [
7], rapidly changing environmental constraints [
8], and safety-critical interactions with uncertain surroundings [
9].
Recent advances in learning-based robotic navigation have demonstrated remarkable success in single-domain path planning and obstacle avoidance for unmanned surface vehicles and ground robots [
10]. Nevertheless, most existing approaches are designed for either water or land environments independently [
11], and their policies often fail when directly transferred across domains due to inconsistent state representations, abrupt medium switching, and unmodeled physical constraints [
12]. Consequently, current methods suffer from unstable transition decisions near shorelines, oscillatory behaviors during medium switching, and elevated collision risks, which significantly limit the real-world deployment of amphibious robotic systems.
To address these limitations, this study investigates the following research question: How can an autonomous robot achieve safe, stable, and efficient navigation across discontinuous water–land environments under environmental uncertainty? We hypothesize that explicitly modeling cross-domain reachability, hierarchical switching decisions, and safety-constrained control is essential to achieve robust amphibious navigation.
The objective of this work is to develop an integrated framework that integrates global cross-domain planning, medium-switching decision-making, and safety-aware continuous control into a coherent joint optimization scheme. However, solving this problem involves several critical challenges. First, water and land environments exhibit fundamentally different dynamic constraints, making it difficult to construct unified environmental representations for global planning. Second, naive policy structures struggle to produce stable medium-switching decisions near boundary regions, leading to frequent oscillations and control instability. Third, safety-critical constraints during shoreline interactions require explicit collision and grounding avoidance mechanisms beyond standard learning formulations.
Motivated by these challenges, we propose CD-HSSRL, a Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework for autonomous amphibious navigation. The proposed framework introduces a Cross-Domain Global Reachability Planner to construct unified cost-aware environmental representations, enabling consistent long-horizon planning across water and land. A Hierarchical Safe Switching Policy is designed to decompose navigation into high-level medium-switching decisions and low-level motion control, enforcing switching stability through regularized option learning. Furthermore, a Safety-Constrained Continuous Controller integrates action safety projection and risk-sensitive reward shaping to guarantee collision-free and stable control during complex water–land transitions. These modules are jointly optimized to achieve unified planning–switching–safety co-optimization for robust cross-domain navigation.
The main contributions of this paper are summarized as follows: (1) We propose a novel Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework that unifies water–land navigation, medium-switching decision-making, and safety-critical control into a jointly optimized hierarchical architecture. (2) We develop a Cross-Domain Global Reachability Planner and a Hierarchical Safe Switching Policy that enable stable and robust amphibious navigation under discontinuous environmental dynamics. (3) We design a Safety-Constrained Continuous Controller that explicitly enforces physical safety constraints during shoreline interaction. (4) Extensive experiments on multiple water-domain, land-domain, and cross-domain benchmarks demonstrate that the proposed method achieves competitive performance compared with baselines in navigation success rate, transition stability, and collision avoidance performance.
2. Related Work
2.1. Amphibious and Cross-Domain Robot Navigation
Amphibious and cross-domain robotic systems have emerged as an important research topic due to their ability to operate in heterogeneous environments where water and land coexist [
13,
14]. Typical application scenarios include coastal inspection, flood rescue [
15], ecological monitoring, and military reconnaissance [
16]. Compared with single-medium robotic platforms, amphibious robots offer greater mission flexibility but also face fundamentally different environmental constraints when transitioning between water and land. Early studies on amphibious navigation mainly relied on model-based planning frameworks [
17], such as graph search, sampling-based path planning, and optimization-based trajectory generation. These approaches usually construct separate environmental models for water and land and design heuristic cost functions to guide navigation. While such methods achieve acceptable performance in structured environments, they heavily depend on accurate dynamic modeling and manually designed traversability maps, which limits their adaptability in complex and uncertain real-world scenes.
More recent research has introduced learning-based strategies to enhance amphibious navigation performance [
18]. Learning-based planners can capture nonlinear hydrodynamic effects and terrain interactions more effectively than traditional analytical models [
19]. However, most existing works still treat water-domain and land-domain planning as two loosely coupled processes, and medium transitions are often handled by handcrafted switching logic or predefined thresholds. This separation leads to inconsistent decision-making near shoreline boundaries, unstable transitions, and reduced robustness under environmental disturbances. Therefore, a unified planning and decision-making framework capable of representing heterogeneous environments and enabling smooth cross-domain transitions remains an open research challenge.
2.2. Reinforcement Learning for Water-Domain and Land-Domain Navigation
Reinforcement learning has become a powerful tool for autonomous navigation in both water and land environments [
20,
21]. Among these methods, widely adopted algorithms such as Proximal Policy Optimization (PPO) and Soft Actor–Critic (SAC) have demonstrated strong performance in continuous control problems. In water-domain navigation, deep reinforcement learning has been widely applied to amphibious and surface robotic platforms for dynamic obstacle avoidance, collision-free control, and energy-efficient path planning. Li et al. [
22] developed an RL-based path planning framework for autonomous underwater vehicles (AUVs), showing improved adaptability in dynamic ocean environments. Mou et al. [
23] proposed a reinforcement learning-based navigation strategy for unmanned surface vehicles, emphasizing safety and trajectory optimization. These methods benefit from end-to-end policy learning and can adapt to complex maritime environments without explicit hydrodynamic modeling. Similarly, in land-domain navigation, reinforcement learning has demonstrated strong capabilities in mobile robot path planning, multi-agent coordination, and navigation in dynamic scenes. Tao et al. [
24] proposes an algorithm called Adaptive Soft Actor–Critic (ASAC), which combines the Soft Actor–Critic (SAC) algorithm, tile coding, and the Dynamic Window Approach (DWA) to enhance path planning capabilities. Such approaches allow robots to learn reactive and anticipatory behaviors directly from environmental interactions.
Despite these advances, most existing reinforcement learning methods are developed and trained in a single domain. Policies learned in water environments usually fail when directly transferred to land environments, and vice versa, due to inconsistent observation distributions, abrupt changes in motion dynamics, and different safety constraints. As a result, existing single-domain reinforcement learning frameworks struggle to generalize across heterogeneous environments and cannot guarantee stable decision-making during water–land transitions. This limitation highlights the necessity of developing cross-domain reinforcement learning frameworks that explicitly model medium-dependent dynamics and enable knowledge sharing between water and land navigation policies.
2.3. Hierarchical Reinforcement Learning and Medium-Switching Decision Making
Hierarchical reinforcement learning decomposes complex decision-making tasks into multiple temporal or functional layers [
25], typically including a high-level planner and a low-level controller. Such hierarchical structures improve learning efficiency, interpretability, and scalability, especially in long-horizon robotic navigation problems. Option-based frameworks further enable the learning of temporally extended actions, allowing robots to switch between different behavioral modes based on environmental contexts.
Hierarchical reinforcement learning has been successfully applied to task decomposition, navigation subtasks, and skill sequencing in robotics. However, existing hierarchical approaches primarily focus on abstract task or goal decomposition and seldom consider physical medium-switching in real-world robotic systems. In amphibious navigation, medium-switching decisions correspond to physically distinct motion regimes, such as floating, shoreline climbing, and ground driving. Without explicit switching stability modeling, hierarchical policies often generate oscillatory decisions near transition regions, resulting in inefficient control and increased risk of grounding or collision. This reveals the need for a hierarchical reinforcement learning framework specifically designed to handle physical medium transitions and enforce stable switching behavior in cross-domain navigation.
Recent hierarchical navigation frameworks such as predictive hierarchical deep reinforcement learning (pH-DRL [
26]) and motion-primitive-based deep Q-learning (MP-DQL [
27]) further demonstrate the effectiveness of long-horizon decision decomposition and structured action spaces in robotic planning. However, these approaches are mainly designed for single-domain navigation and do not explicitly model physical medium transitions or switching stability in amphibious systems.
2.4. Safe Reinforcement Learning and Constraint-Aware Robotic Control
Safety is a critical requirement for autonomous robots operating in real environments. To address safety concerns, safe reinforcement learning techniques have been proposed to incorporate physical constraints and risk-awareness into policy learning. Common strategies include action projection layers that filter unsafe control commands, constraint-aware optimization objectives, and risk-sensitive reward formulations that penalize unsafe behaviors. These methods effectively improve collision avoidance and system robustness in single-domain navigation tasks.
However, most existing safe reinforcement learning frameworks focus on either water-domain or land-domain safety constraints independently. In cross-domain amphibious navigation, safety risks are amplified during medium transitions, such as shoreline climbing, water entry, and obstacle interaction at boundary regions. Existing safe control strategies are rarely integrated with medium-switching decision policies or global cross-domain planners, leading to fragmented safety handling mechanisms. Consequently, guaranteeing safety throughout the entire water–land transition process remains challenging. This motivates the development of an integrated framework that jointly considers safety constraints, medium-switching decisions, and cross-domain planning in an integrated learning architecture.
Representative safety-filtering frameworks such as BarrierNet [
28] introduce differentiable control barrier functions to enforce safety constraints during policy execution. While these methods provide strong collision avoidance guarantees, they are not integrated with hierarchical medium-switching decision policies or cross-domain global planners, limiting their applicability in discontinuous water–land navigation.
Some recent studies have focused on sim-to-real transfer in reinforcement learning, aiming to bridge the gap between simulation and real-world deployment. These approaches typically employ domain randomization, adaptation learning, or hierarchical training strategies [
29,
30]. However, most existing sim-to-real methods assume a single-domain setting with consistent dynamics, and they do not explicitly address cross-domain navigation involving discontinuous transitions, such as water–land scenarios.
Overall, prior research has made significant progress in amphibious navigation, single-domain reinforcement learning, hierarchical decision-making, and safe control. Nevertheless, a unified approach that simultaneously addresses cross-domain environmental representation, stable medium-switching decisions, and safety-constrained continuous control is still lacking. To bridge this gap, this paper proposes CD-HSSRL, a Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework that integrates global reachability planning, hierarchical switching policies, and safety-constrained control to achieve robust autonomous amphibious navigation.
3. Method
In this section, we present the proposed Cross-Domain Hierarchical Safe Switching Reinforcement Learning (CD-HSSRL) framework for autonomous navigation and path planning of amphibious robots across water–land environments. We first formulate the cross-domain navigation problem and then introduce the overall hierarchical architecture, followed by the detailed design of each functional module.
3.1. Problem Formulation
We consider an amphibious robot operating in a mixed water–land environment. The environment is represented by a cross-domain state space , where , , and denote water, land, and transition regions, respectively. The robot dynamics vary across domains, leading to discontinuous motion models.
The navigation objective is to find a policy
that drives the robot from a start state
to a goal state
while minimizing cumulative cost and satisfying safety constraints. This problem is formulated as a constrained Markov decision process (CMDP):
where
is the action space,
is the transition probability,
is the reward function,
denotes constraint cost, and
is the discount factor.
The optimization objective is
subject to the safety constraint
where
d is a predefined safety threshold limiting collision, grounding, or rule-violation risks, and
denotes the discount factor for accumulated safety costs.
The key challenge lies in simultaneously handling (1) discontinuous dynamics across water–land domains, (2) long-horizon global reachability under heterogeneous environmental costs, and (3) safety-critical control during medium transitions.
3.2. Platform Description
The target platform used in this study is a tracked-thruster amphibious robot designed specifically for operation in both land and underwater environments. As shown in
Figure 1, the robot is equipped with a tracked land mobile unit and a thruster-type underwater propulsion unit. This hybrid configuration allows the robot to traverse complex shoreline environments while maintaining maneuverability in shallow and deep water.
The onboard sensing suite consists of a Global Positioning System (GPS), an Inertial Measurement Unit (IMU), a 3D lidar sensor, an ultrasonic sensor, and a water depth sensor. These sensors provide complementary information that supports both navigation and motion control.
3.3. Overall Framework of CD-HSSRL
To address the above challenges, we propose CD-HSSRL (Cross-Domain Hierarchical Safe-Switching Reinforcement Learning), a hierarchical planning–learning architecture for autonomous amphibious robot navigation across water–land environments. The framework decomposes the amphibious navigation problem into three cooperative layers, enabling structured decision-making from long-horizon planning to low-level safe control.
Formally, the overall navigation policy is factorized as
where
denotes a high-level switching policy that selects a domain-specific motion option
, and
denotes a low-level continuous control policy that generates executable control actions conditioned on the selected option.
The CD-HSSRL framework consists of three major components.
First, the Cross-Domain Global Reachability Planner constructs a unified cost-aware representation of the water–land environment and generates a global waypoint sequence that guarantees long-horizon reachability while avoiding risky regions such as shallow waters, steep shorelines, and high-friction terrains.
Second, the Hierarchical Safe Switching Policy learns when and where to switch between water, transition, and land motion modes. This high-level policy integrates global waypoint guidance and current state observations to produce stable and consistent medium-switching decisions under discontinuous cross-domain dynamics.
Third, the Safety-Constrained Continuous Controller produces smooth and safe continuous control actions under physical and rule-based constraints. A safety projection layer filters raw actions to satisfy collision avoidance, shoreline stability, and maritime rule compliance, while a risk-sensitive reward formulation further encourages safe navigation behaviors.
By jointly optimizing the high-level switching policy and the low-level controller, the proposed framework achieves coordinated cross-domain decision-making and safety-aware motion control.
The overall architecture of the proposed CD-HSSRL framework is illustrated in
Figure 2.
3.4. Cross-Domain Global Reachability Planner
The Cross-Domain Global Reachability Planner (CD-GRP) is designed to generate a long-horizon feasible navigation skeleton that guarantees reachability across heterogeneous water–land environments while avoiding high-risk regions. Unlike conventional global planners that operate on single-terrain maps, CD-GRP constructs a unified cost-aware representation integrating water depth, shoreline slope, land traction, and obstacle distributions.
Specifically, four domain-dependent cost layers are first constructed:
These cost layers are fused into a unified cross-domain cost map:
where
, and
are weighting coefficients balancing safety and traversability considerations.
The weighting coefficients control the relative importance of different navigation objectives, including obstacle avoidance, terrain traversability, and transition stability. In this work, the coefficients are initialized using heuristic prior knowledge and subsequently adjusted through empirical validation experiments. Specifically, higher water depth cost encourages land movement; transition-related weights improve switching smoothness near water–land boundaries; higher terrain-cost weights discourage traversal through unstable or shallow regions; higher obstacle weights encourage safer navigation behavior.
The final parameter values are selected based on performance trade-offs observed during validation experiments.
Based on the unified cost map
, an incremental global path search is performed to obtain an optimal reachability path:
with heuristic-guided evaluation:
where
denotes the accumulated cost from the start node to node
n, and
is the heuristic distance estimate to the goal. The incremental search mechanism enables efficient replanning under dynamic environmental updates.
The final output of CD-GRP is a global waypoint sequence:
which provides high-level guidance for the subsequent hierarchical switching policy.
The overall cost-map fusion and incremental reachability planning process of CD-GRP is illustrated in
Figure 3.
3.5. Hierarchical Safe Switching Policy
Due to the discontinuous dynamics between water and land motion, directly learning a monolithic policy often leads to unstable behaviors during medium transitions. To address this issue, we propose a Hierarchical Safe Switching Policy (HSSP), which learns to select appropriate domain-specific motion modes while maintaining switching stability.
At each decision step, the high-level switching policy selects an option:
where
is a neural policy network that takes the current state
and global waypoint guidance
as input.
Once an option is selected, it remains active until a termination condition is satisfied:
To discourage unnecessary frequent medium switching, a switching regularization loss is introduced:
where
controls the stability penalty strength, which softly penalizes abrupt changes in option distributions to ensure stable and smooth medium-switching behaviors.
The switching regularization term is introduced to balance transition responsiveness and policy stability during cross-domain navigation. Without regularization, the agent may frequently oscillate between navigation modes near ambiguous transition boundaries, resulting in unstable trajectories and increased collision risk. Conversely, excessively strong regularization suppresses switching behavior and may delay necessary adaptation when environmental conditions change rapidly. Therefore, the switching penalty introduces an explicit trade-off between adaptability and stability, which is particularly important in heterogeneous transition regions where environmental dynamics change discontinuously.
The high-level switching policy is optimized using a clipped PPO objective:
where
and
is the advantage estimate.
The execution loop and optimization flow of the proposed HSSP are illustrated in
Figure 4.
3.6. Safety-Constrained Continuous Controller
While the high-level policy determines the motion mode, the low-level controller must generate continuous control actions that are dynamically feasible and safe in real time. We therefore design a Safety-Constrained Continuous Controller (SCCC) that integrates stochastic policy learning with explicit safety constraint enforcement. In particular, the safety constraints explicitly encode collision avoidance and shoreline grounding prevention, which are critical failure modes during water–land medium transitions.
The low-level control policy outputs a raw action:
where
is a stochastic actor network conditioned on the current state and selected option.
Although the low-level policy can generate continuous control commands, the raw action may violate safety requirements in obstacle-dense or shallow-water transition regions. Therefore, before execution, the action is checked against a set of explicitly defined safety constraints and projected into the feasible safe action space when necessary:
where
denotes the raw action generated by the low-level policy,
is the corrected safe action, and
represents the projection operator onto the state-dependent safe action set
:
where
,
, and
represent obstacle avoidance, grounding prevention, and dynamic feasibility constraints, respectively.
where
denotes the predicted next position,
represents surrounding obstacles, and
is the predefined minimum safety distance.
where
denotes the local water depth or terrain clearance, and
is the minimum safe operating depth required to avoid grounding.
where
denotes the maximum allowable control magnitude.
At each control step, the next-state position is estimated based on the current vehicle state and candidate action. Obstacle distances are computed from the cost map, while terrain elevation and local depth information are extracted from the Gazebo + UUV simulation environment.
If all constraints are satisfied, the original action is directly executed. Otherwise, the action is projected into the feasible safe action set by adjusting its magnitude or direction.
In addition, we introduce a risk-sensitive reward shaping strategy:
where
is the risk penalty coefficient. This formulation encourages the controller to prioritize safe behaviors while preserving navigation efficiency.
The low-level policy is optimized using the Soft Actor–Critic (SAC) objective:
where
is trained using the standard soft Bellman residual in Soft Actor–Critic.
The overall architecture and optimization loop of SCCC are illustrated in
Figure 5.
3.7. Training Objective and Optimization
The overall CD-HSSRL framework is trained by jointly optimizing the high-level switching policy and the low-level controller. The total loss function is defined as
where
denotes the PPO loss for the medium-switching policy,
denotes the SAC loss for continuous control, and
is the switching regularization term.
Parameters
and
are updated using stochastic gradient descent:
where
is the learning rate.
This joint optimization allows coordinated learning between global switching decisions and local continuous control behaviors.
3.8. Algorithm Pseudocode
Algorithm 1 outlines the training procedure of the proposed CD-HSSRL (Cross-Domain Hierarchical Safe-Switching Reinforcement Learning), integrating global reachability planning, hierarchical medium-switching learning, and safety-constrained continuous control.
| Algorithm 1: CD-HSSRL Training Procedure |
![Jmse 14 00859 i001 Jmse 14 00859 i001]() |
4. Experiments
4.1. Datasets and Experimental Settings
To comprehensively evaluate the proposed CD-HSSRL framework for water–land cross-domain autonomous navigation and path planning, we conduct experiments on a suite of publicly available real-world datasets and a physics-based cross-domain amphibious simulation benchmark. This experimental design ensures that water-surface navigation, dynamic obstacle avoidance, land-based planning, and cross-domain transition behaviors are rigorously validated under reproducible conditions while enabling fair comparison with hierarchical planning and safety-constrained control baselines.
To improve the diversity and robustness of evaluation, all experiments are conducted under randomized initialization conditions, including randomized start/goal positions, obstacle layouts, and environmental disturbances. This randomized evaluation strategy helps reduce overfitting to specific environment configurations and provides a broader assessment of cross-domain navigation behavior.
WaterScenes Dataset: For water-surface environment perception and navigation evaluation, we adopt the WaterScenes dataset [
31], which is a large-scale multimodal dataset containing synchronized radar and monocular camera data collected in real maritime environments. The dataset provides annotated water-surface scenes with moving vessels, shoreline structures, and free-space segmentation labels, enabling reliable construction of water-surface navigation states. In our experiments, WaterScenes is used to construct perception-driven water-domain navigation scenarios by converting semantic free-space and obstacle annotations into navigable occupancy and risk maps. Thus, the dataset serves as a realistic maritime perception benchmark for generating navigation states rather than providing direct control labels. It supports the evaluation of water-mode planning and collision avoidance performance under real visual sensing conditions. This dataset is primarily used to benchmark water-surface navigation baselines such as APF-DQN, I-DDPG, MORL, RLCA, and APF-D3QNPER.
Maritime Visual Tracking Dataset (MVTD): To assess dynamic obstacle avoidance in complex marine environments, we employ the Maritime Visual Tracking Dataset (MVTD) [
32], which contains high-resolution video sequences of vessels under diverse sea states and lighting conditions. MVTD enables the construction of highly dynamic navigation scenarios with moving maritime targets by transforming visual tracking sequences into dynamic obstacle fields for decision-making evaluation. Therefore, MVTD is employed as a perception-driven dynamic navigation benchmark for assessing temporal decision-making and the safety performance of learning-based planners. This dataset is used to validate dynamic avoidance capabilities against baselines, including APF-D3QNPER, RLCA, CLPPO-GIC, and MORL-based methods.
BARN Ground Navigation Benchmark: To evaluate land-domain navigation and provide a standardized ground-planning baseline, we use the Benchmark for Autonomous Robot Navigation (BARN) [
33], which consists of procedurally generated navigation environments with varying obstacle densities and complexity levels. BARN is employed to test land-mode planning and continuous control performance of CD-HSSRL and to compare against amphibious and multi-objective baselines such as IPPO, DDQN, HEA-PPO, and IMTCMO. In addition, hierarchical planning baselines such as pH-DRL and planning–learning integration methods such as MP-DQL are evaluated on BARN to benchmark long-horizon hierarchical decision-making and structured planning performance.
Cross-Domain Amphibious Benchmark Environment: Currently, no publicly available dataset contains real-world navigation data involving continuous water–land transition behaviors. To evaluate cross-domain switching and safety-constrained control under realistic physical constraints, we construct a physics-based cross-domain amphibious benchmark environment in Gazebo with water-surface and ground-contact plugins. The simulator models water depth variation, shoreline slope transitions, hydrodynamic drag, terrain friction, and obstacle interactions, thereby forming a reproducible benchmark for water–land transition evaluation. This benchmark environment is used to assess cross-domain reachability planning, medium-switching stability, and safety-constrained control performance of CD-HSSRL. Furthermore, safety-aware baselines such as BarrierNet are evaluated in this environment to compare safety-constrained continuous control and collision-avoidance performance, while pH-DRL and MP-DQL are also tested to benchmark hierarchical switching and planning–learning coupling in cross-domain tasks.
Task Protocol and Data Split: For each dataset and simulation environment, navigation tasks are generated by randomly sampling start and goal positions under domain-specific constraints. Each scenario is evaluated under 100 randomized navigation episodes. For reinforcement learning training, 80% of the generated episodes are used for training, 10% for validation, and 10% for testing. All baselines and the proposed method are trained and evaluated under identical environment settings to ensure fair comparison. All scenario generation scripts, environment configurations, and evaluation protocols will be released to ensure reproducibility.
To construct realistic and controllable navigation scenarios, the original dataset is transformed into a simulation-compatible environment within the Gazebo + UUV framework. The conversion process consists of the following stages. (1) Data Preprocessing: Raw dataset inputs, including spatial layouts and environmental features, are first normalized and discretized into a structured representation. Irrelevant or noisy elements are filtered to ensure consistency with the simulation requirements. (2) Environment Mapping: The processed data are mapped into a simulation environment by generating corresponding terrain structures, obstacle distributions, and domain labels (e.g., water, land, and transition regions). In particular, depth information and terrain elevation are used to define heterogeneous regions and cross-domain boundaries. (3) Physical Parameter Assignment: To ensure realistic dynamics, physical parameters such as friction coefficients, drag forces, and buoyancy effects are assigned based on the mapped environment. These parameters are integrated into the UUV Simulator to reflect domain-specific behaviors. (4) Scenario Generation: Multiple navigation scenarios are generated by varying start and goal positions, obstacle densities, and environmental conditions. This allows systematic evaluation under diverse and controlled settings.
Despite enabling flexible environment construction, the dataset-to-environment conversion process may introduce several sources of bias. First, discretization and simplification of raw data may lead to loss of fine-grained environmental details, potentially affecting the fidelity of terrain representation. Second, the mapping from dataset features to simulation parameters relies on predefined assumptions, which may not fully capture real-world variability. Third, the generated scenarios may exhibit distributional differences compared to real-world environments, particularly in terms of dynamic interactions and sensor noise characteristics. Finally, the use of domain labels introduces a level of abstraction that may oversimplify complex cross-domain transitions. To mitigate these issues, multiple scenarios with varying configurations are evaluated, and robustness experiments under disturbances are conducted to assess generalization performance.
Through the above experimental setup, the proposed CD-HSSRL framework is systematically evaluated on water-domain navigation, land-domain planning, dynamic obstacle avoidance, hierarchical decision-making, and cross-domain transition tasks, providing comprehensive validation of its effectiveness, safety, and generalization ability.
Figure 6 demonstrated the overall experimental process.
4.2. Implementation Details
All experiments are implemented in Python 3.13.5 using the PyTorch deep learning framework. The reinforcement learning components are built upon the OpenAI Gym interface and Stable-Baselines3 library, while the amphibious simulation environment is developed in Gazebo with UUV-Simulator plugins. All experiments are conducted on a workstation equipped with an NVIDIA RTX 4090 GPU and an Intel Xeon CPU.
Network Architecture: For the high-level switching policy , we adopt a multilayer perceptron with two hidden layers of 256 units, followed by a softmax output layer for option selection. For the low-level continuous control policy , we use an actor–critic architecture with two fully connected hidden layers of 256 units. ReLU activation is applied in all hidden layers. The Q-networks in SAC and value networks in PPO share the same backbone structure for fair comparison across learning-based baselines. For hierarchical baselines such as pH-DRL, the high-level and low-level networks follow the original two-layer hierarchical architecture described in their implementation. For MP-DQL, motion primitive libraries are constructed according to the original setting, and DQN networks are implemented with the same backbone size as our planner network. For the safety-control baseline BarrierNet, the differentiable barrier layer is integrated on top of a continuous control policy network with identical hidden dimensions.
State and Action Representation: The state input consists of local observation features, global waypoint guidance, and domain indicators (water, transition, and land). For WaterScenes and MVTD, visual observations are encoded using a lightweight convolutional encoder to extract semantic features. For BARN- and Gazebo-based environments, LiDAR-like occupancy grids and robot kinematic states are used as inputs. All learning-based baselines, including pH-DRL, MP-DQL, and BarrierNet, are adapted to use the same unified observation space and action definitions to ensure fair comparison. The action space includes continuous linear velocity and angular velocity commands.
Training Hyperparameters: The discount factor is set to . For the PPO-based high-level switching policy, the clipping parameter is , and the learning rate is . For the SAC-based low-level controller, the entropy coefficient is automatically tuned, and the learning rate is . The switching regularization coefficient is set to , and the safety risk penalty coefficient is . The replay buffer size is , and mini-batches of size 256 are sampled for each update. For pH-DRL and MP-DQL baselines, the original hyperparameters reported in their papers are adopted and then slightly tuned to match the unified environment scale. For BarrierNet, the barrier function penalty coefficient follows the default setting in the original implementation.
Training Protocol: All methods are trained for 2 million environment interaction steps. For each baseline and the proposed method, multiple randomized navigation scenarios are evaluated under different initialization conditions. Model checkpoints with the best validation performance are selected for final testing. To ensure fair comparison, all baselines are trained using the same observation space, action space, reward definitions, and environment settings.
Simulation Settings: In the Gazebo amphibious simulation, water drag coefficients, shoreline slope limits, and terrain friction parameters are calibrated according to standard USV and ground robot dynamic models. Collision detection and grounding events are monitored to compute safety-related evaluation metrics. The simulation runs at 20 Hz control frequency for all tested methods, including safety-constrained baselines such as BarrierNet.
Reproducibility: All datasets used in this study are publicly available. The simulation environment configuration files, training scripts, and evaluation protocols are publicly available at
https://github.com/ls142968/CD-HSSRL.git (accessed on 28 April 2026). These implementation settings ensure stable training, fair baseline comparison, and reproducible evaluation for cross-domain amphibious navigation and path planning.
4.3. Baselines
To comprehensively evaluate the effectiveness of the proposed cross-domain navigation framework for autonomous tracked amphibious robotic systems, we compare our method with a set of representative and recent baselines covering amphibious cross-domain path planning, learning-based water-domain navigation, collision avoidance under rule-constrained navigation, multi-objective decision-making, and safety-aware hierarchical control. All selected baselines are derived from published studies with explicitly named methodologies and established experimental protocols. This comparison set ensures a fair and comprehensive validation of global planning capabilities, cross-medium adaptability, dynamic obstacle avoidance, hierarchical decision-making, and safety-constrained control. All baseline methods are implemented within the same Gazebo + UUV simulation framework and adapted to a unified observation and action space. Only interface-level modifications are introduced to ensure compatibility with the proposed simulation environment, while the original algorithmic structures of all baselines are preserved.
Unified Observation Space: All methods receive identical state observations, including (1) vehicle position and orientation, (2) linear and angular velocities, (3) obstacle distance information extracted from the local cost map, (4) terrain-related features such as local depth and transition-region indicators, and (5) local environmental disturbance information. The observation dimensions are kept consistent across all methods to avoid performance differences caused by unequal environmental information.
Unified Action Space: All methods output continuous control commands consisting of forward velocity, steering/angular control, and thrust-related actuation signals. Action ranges are normalized to the same control bounds across all methods to ensure equivalent actuation capability.
Unified Reward Structure: To minimize bias introduced by reward engineering, a shared reward structure is adopted whenever possible. The reward function includes the following: goal-reaching reward, collision penalty, transition smoothness regularization, and energy-consumption penalty. Only minimal modifications required for algorithmic compatibility are introduced.
Unified Training and Evaluation Settings: All methods are trained and evaluated under identical simulation conditions, including the same environment layouts, obstacle configurations, cross-domain transition regions, disturbance settings, training episodes, and random-seed initialization strategy. No additional privileged information is provided to the proposed method.
Cross-Domain Amphibious Path Planning Baselines: IPPO [
34] proposes an Improved Proximal Policy Optimization framework for global path planning of amphibious robots. It enhances PPO by integrating attention and recurrent modules to address discontinuous dynamics during medium switching, making it a representative baseline for cross-domain reinforcement learning-based navigation. For fair comparisons, IPPO is adapted to the unified continuous-control observation and action space while preserving its original policy architecture.
DDQN [
35] introduces a global path planning algorithm based on Double Deep Q-Networks for multi-task amphibious robotic platforms. This work represents one of the early reinforcement learning solutions for amphibious navigation, serving as a fundamental value-based baseline for cross-medium global planning. Since DDQN originally employs discrete action selection, its action interface is discretized from the unified continuous control space while maintaining identical environmental observations.
HEA-PPO [
36] combines a hyper-heuristic evolutionary algorithm with PPO to achieve energy-constrained collaborative path planning for heterogeneous amphibious robotic systems. It provides a hybrid evolutionary–learning strategy to handle multi-robot coordination and complex environmental constraints. The method is adapted using the same state observations and reward structure adopted in our framework.
IMTCMO [
37] proposes an improved multitasking-constrained multi-objective optimization framework for multi-amphibious robotic collaboration in constrained environments. Unlike end-to-end learning approaches, IMTCMO focuses on constrained multi-objective optimization, providing a strong non-learning baseline for cross-domain path planning under multiple conflicting objectives. The planner receives the same terrain and obstacle information as all learning-based methods.
Learning-Based Water-Domain Navigation Baselines: APF-DQN [
38] presents a hybrid artificial potential field–DQN framework enhanced with ocean current prediction for water-surface robotic navigation in dynamic environments. By integrating physical prior guidance with deep Q-learning, it serves as a representative baseline for physics-guided learning in water-domain navigation. The action space is discretized consistently with the DDQN baseline.
I-DDPG [
39] proposes an improved deep deterministic policy gradient algorithm for continuous-action water-domain navigation, targeting control smoothness and reward shaping for dynamic environments. This method acts as a typical actor–critic continuous-action baseline for comparing control stability and convergence behavior. The original continuous-action architecture is preserved without structural modification.
MORL-based [
40] designs a multi-objective reinforcement learning architecture for water-domain robotic navigation, employing ensemble decision mechanisms to balance safety, efficiency, and energy consumption. It provides a canonical baseline for multi-objective decision-making in learning-based navigation. The reward weights are normalized to align with the unified evaluation objectives.
Safety-Aware Collision Avoidance and Dynamic Decision Baselines: RLCA [
41] introduces a reinforcement learning collision avoidance algorithm by explicitly incorporating maneuvering characteristics and rule-constrained navigation principles into the learning framework. This method forms a representative safety-aware baseline for rule-constrained collision avoidance in autonomous robotic navigation. The same obstacle and transition-region information are provided as environmental inputs.
APF-D3QNPER [
42] proposes a hybrid deep learning architecture combining artificial potential fields, dueling double DQN, prioritized experience replay, and LSTM for navigation in unknown dynamic environments. It provides a strong baseline for dynamic obstacle avoidance with temporal memory and guided exploration. Its observation interface is standardized to the same local environmental representation used in our framework.
CLPPO-GIC [
43] develops a CNN–LSTM–PPO framework with a generalized integral compensator mechanism for multi-agent autonomous collision avoidance. By integrating temporal feature extraction and state-error compensation into PPO, it serves as a representative baseline for sequential decision-making and dynamic interaction scenarios. The network structure remains unchanged while adopting the same simulation settings and action constraints.
Hierarchical Planning and Safety-Constrained Control Baselines: BarrierNet [
28] proposes differentiable control barrier functions for learning safe robot control. By embedding a safety-filtering layer into policy optimization, it represents a representative baseline for safety-constrained continuous control and directly corresponds to the safety projection mechanism in our controller. The same safety constraints and control bounds are applied during evaluation.
pH-DRL [
26] introduces a predictive hierarchical reinforcement learning framework for long-horizon navigation, where a high-level planner guides low-level controllers through predictive sub-goal generation. This method serves as a representative hierarchical decision-making baseline comparable to our Hierarchical Safe Switching Policy. The hierarchical interfaces are preserved while adapting the observation inputs to the unified environment representation.
MP-DQL [
27] formulates motion primitives as the action space of deep Q-learning for autonomous driving planning. By integrating structured global planning with deep learning-based decision-making, it provides a strong baseline for comparing cross-domain global reachability planning and planning–learning joint optimization. Motion primitives are re-parameterized according to the amphibious vehicle dynamics while maintaining the original planning logic.
Overall, these baselines collectively cover cross-domain amphibious navigation, learning-based water-domain navigation, rule-constrained collision avoidance, multi-objective optimization, hierarchical decision-making, and safety-constrained control. By standardizing observation space, action space, reward structure, and environmental conditions, the proposed CD-HSSRL framework is evaluated under a consistent and reproducible experimental protocol, ensuring that performance differences primarily arise from algorithmic characteristics rather than inconsistent implementation settings.
4.4. Evaluation Metrics
To comprehensively evaluate the effectiveness of the proposed CD-HSSRL framework in cross-domain autonomous navigation and path planning, we adopt a set of quantitative metrics covering navigation success, safety performance, efficiency, and switching stability. All metrics are computed consistently for the proposed method and all baselines under identical experimental settings.
Success Rate (SR): The success rate measures the proportion of navigation trials in which the robot successfully reaches the target without collision or grounding:
Collision Rate (CR): The collision rate evaluates safety performance by measuring the frequency of collision or grounding events:
Safety Violation Rate (SVR): To further assess safety-constrained control performance, we measure the frequency of safety constraint violations:
where
denotes episodes where safety constraints (collision, grounding, or forbidden-zone entry) are violated. This metric is particularly used to compare safety-aware baselines such as BarrierNet.
Average Path Length (APL): APL measures navigation efficiency by computing the average traveled path length:
Average Navigation Time (ANT): ANT evaluates decision-making and planning efficiency by measuring the average time steps required to reach the target:
where
denotes the completion time steps of episode
i. This metric is mainly used to compare hierarchical planning and planning–learning baselines such as pH-DRL and MP-DQL.
Energy Consumption (EC): Energy consumption evaluates control efficiency by accumulating actuation energy along trajectories:
Switching Stability Index (SSI): To quantify medium-switching stability across water–land transitions, we define a Switching Stability Index:
Cross-Domain Transition Success Rate (CTS): CTS evaluates the success probability of completing water–land or land–water transitions without failure:
These metrics jointly evaluate global reachability, local safety, control efficiency, hierarchical decision-making performance, and cross-domain switching capability, providing a comprehensive assessment of the proposed CD-HSSRL framework against all baselines.
5. Results and Discussion
5.1. Overall Comparison with Representative Baselines
We first conduct a comprehensive comparison between the proposed CD-HSSRL framework and representative baselines on water-domain navigation, land-domain navigation, and cross-domain transition tasks. The evaluated baselines include IPPO, DDQN, HEA-PPO, IMTCMO, APF-DQN, I-DDPG, MORL-based, RLCA, APF-D3QNPER, and CLPPO-GIC and three recently added high-quality baselines: BarrierNet, pH-DRL, and MP-DQL. All methods are trained and tested under identical observation spaces, action spaces, reward functions, and environment settings to ensure fair comparison. For baselines that are originally defined with structured action spaces (e.g., MP-DQL) or safety-filtering layers (e.g., BarrierNet), we follow their original protocol while aligning the state representation and evaluation interface to our unified cross-domain navigation setting.
Overall Quantitative Results:
Table 1 reports the overall performance on the WaterScenes, MVTD, BARN, and Gazebo cross-domain environments. For WaterScenes, MVTD, and BARN, we report the success rate (SR), collision rate (CR), average path length (APL), and energy consumption (EC). For the Gazebo cross-domain environment, we report the SR, CR, Switching Stability Index (SSI), and Cross-Domain Transition Success Rate (CTS), which directly measure medium-switching stability and transition robustness. The best results are highlighted in bold.
Visualization of SOTA Comparison: To provide an intuitive comparison,
Figure 7 visualizes the SR and CR performance across different datasets. CD-HSSRL consistently achieves higher success rates and lower collision rates compared with all baselines, particularly in the Gazebo cross-domain environment, demonstrating its effective cross-medium decision-making and safety control capability.
Result Analysis: From
Table 1 and
Figure 7, several observations can be made.
First, on WaterScenes and MVTD, CD-HSSRL achieves favorable performance compared with USV-oriented baselines such as APF-DQN, I-DDPG, and RLCA, indicating that the proposed Safety-Constrained Continuous Controller effectively improves dynamic obstacle avoidance under complex maritime conditions. Moreover, compared with BarrierNet, CD-HSSRL achieves higher SR with comparable or lower CR, suggesting that jointly optimizing hierarchical switching with safety-aware control yields additional benefits beyond purely safety-filtered control.
Second, on the BARN benchmark, CD-HSSRL achieves comparable or better performance than land-navigation and hierarchical planning baselines such as IPPO, HEA-PPO, and pH-DRL, demonstrating that the low-level controller maintains stable control performance and the high-level policy supports effective long-horizon decision-making even without water-domain dynamics.
Third, in the Gazebo cross-domain environment, CD-HSSRL shows a consistently higher Cross-Domain Transition Success Rate (CTS) and Switching Stability Index (SSI) than amphibious baselines such as IPPO, DDQN, HEA-PPO, and IMTCMO, as well as newly added hierarchical and planning baselines (pH-DRL and MP-DQL). This verifies that the Hierarchical Safe Switching Policy and unified cross-domain reachability planner effectively handle discontinuous water–land dynamics. In addition, CD-HSSRL achieves the lowest CR among all compared methods, indicating that the safety-constrained controller is essential for preventing grounding and collisions during shoreline interaction.
Overall, these results confirm that CD-HSSRL achieves competitive performance across water-domain navigation, land-domain planning, and cross-domain transition tasks, validating the effectiveness of the proposed CD-HSSRL framework for autonomous amphibious robot navigation and path planning.
5.2. Cross-Domain Transition Performance
Since the primary contribution of CD-HSSRL lies in handling discontinuous water–land dynamics, we further conduct dedicated experiments to evaluate cross-domain transition performance in the Gazebo-based amphibious simulation environment. Three representative transition tasks are designed: (1) water to land (shoreline climbing), (2) land to water (water entry), and (3) multiple transitions (water–land–water). These tasks explicitly test global reachability planning, medium-switching stability, and safety-constrained control under realistic cross-domain physical interactions.
Baselines for Cross-Domain Evaluation: To ensure a fair and mechanism-consistent comparison, we select four representative baselines for cross-domain transition evaluation: IPPO as a reinforcement learning-based amphibious navigation method, HEA-PPO as an optimization-driven energy-constrained amphibious planner, RLCA as a rule-based safety-aware collision avoidance strategy, and BarrierNet as a differentiable safety-constrained control framework. These baselines respectively correspond to cross-domain policy learning, multi-objective optimization, rule-constrained safety control, and optimization-based safety filtering, thus providing comprehensive comparative perspectives for evaluating hierarchical switching and safety-constrained control in CD-HSSRL.
Quantitative Results:
Table 2 summarizes cross-domain transition performance in terms of the Cross-Domain Transition Success Rate (CTS), Switching Stability Index (SSI), collision rate (CR), Safety Violation Rate (SVR), and energy consumption (EC).
Trajectory Visualization: To qualitatively illustrate cross-domain navigation behaviors,
Figure 8 shows representative trajectories of CD-HSSRL, IPPO, and HEA-PPO in the water-to-land task. While IPPO and HEA-PPO often experience unstable mode switching or partial grounding near the shoreline due to the lack of explicit switching stability constraints, BarrierNet achieves safe but conservative shoreline behaviors with slower progress, whereas CD-HSSRL generates smooth transition trajectories and successfully reaches land targets without oscillatory control. Although some trajectories are longer, the proposed method generates smoother and safer transitions with reduced collision risk near domain boundaries.
Figure 9 shows the position of the robot at different times from the starting point to the ending point.
Switching Sequence Analysis: To further examine switching stability,
Figure 10 visualizes the temporal evolution of motion modes during cross-domain navigation. For clarity of temporal illustration, IPPO is selected as the representative reinforcement learning baseline, and BarrierNet is selected as the representative safety-filtering baseline. CD-HSSRL exhibits consistent and minimal mode switches, whereas IPPO shows frequent oscillations between water and transition modes, and BarrierNet tends to delay switching decisions due to conservative safety constraints, leading to reduced transition efficiency.
Result Analysis: From
Table 2, CD-HSSRL achieves the highest CTS and SSI among all compared methods, indicating effective cross-domain transition robustness and stable medium-switching decisions. In particular, CD-HSSRL improves CTS by 7–15% over representative baselines IPPO, HEA-PPO, RLCA, and BarrierNet, demonstrating the effectiveness of the Hierarchical Safe Switching Policy. Moreover, the lowest CR and SVR confirm that the Safety-Constrained Continuous Controller successfully prevents grounding and collision events during shoreline interaction. Although BarrierNet maintains strong safety performance through explicit constraint enforcement, it exhibits higher energy consumption and slower transitions due to conservative action filtering.
Overall, these results verify that the proposed CD-HSSRL framework effectively addresses discontinuous cross-domain dynamics and achieves competitive performance in amphibious water–land transition tasks.
5.3. Ablation Studies
To investigate the contribution of each key component in CD-HSSRL, we conduct ablation experiments by selectively removing major modules from the proposed framework. All ablation variants are evaluated under the same Gazebo cross-domain transition tasks and MVTD dynamic obstacle scenarios, since these environments best reflect the core challenges of cross-domain switching and safety-aware control.
Ablation Settings: We design five representative ablation variants:
A1: w/o CD-GRP—removing the Cross-Domain Global Reachability planner, replacing it with a local greedy planner.
A2: w/o HSSP—removing the Hierarchical Safe Switching Policy and using a single flat policy.
A3: w/o Safety Projection—removing the safety-constrained action projection layer.
A4: w/o Risk-Sensitive Reward—removing the risk penalty term in reward shaping.
A5: w/o Switching Regularization—removing the switching stability loss .
Quantitative Results:
Table 3 reports the ablation results in terms of the Cross-Domain Transition Success Rate (CTS), Switching Stability Index (SSI), collision rate (CR), and energy consumption (EC).
Visualization of Ablation Impact:
Figure 11 visualizes the impact of removing each module on CTS and CR. Removing HSSP and the switching regularization term causes significant degradation in SSI and CTS, while removing the safety projection layer leads to a sharp increase in collision rate. These observations highlight the necessity of hierarchical switching and explicit safety enforcement in cross-domain navigation.
Result Analysis: From
Table 3, removing the Cross-Domain Global Reachability Planner (A1) reduces CTS by 8%, indicating that unified cross-domain cost-aware planning is essential for successful shoreline transitions. Removing the Hierarchical Safe Switching Policy (A2) results in unstable mode decisions and a significant drop in SSI, demonstrating the importance of structured option-based switching for discontinuous water–land dynamics. The absence of the Safety Projection layer (A3) causes CR to increase drastically, confirming that explicit constraint enforcement is critical for preventing grounding and collisions. Finally, removing the risk-sensitive reward or switching regularization (A4 and A5) leads to moderate but consistent performance degradation, showing that both safety-oriented reward shaping and switching stability loss contribute to robust and efficient navigation.
Overall, the ablation results verify that each proposed module plays a complementary and indispensable role in achieving robust cross-domain autonomous navigation.
5.4. Robustness Analysis
In real-world amphibious navigation, environmental disturbances, perception uncertainty, and scene complexity may significantly affect policy stability and safety. To evaluate the robustness of CD-HSSRL under such uncertainties, we conduct robustness experiments from three perspectives: (1) hydrodynamic disturbance intensity, (2) perception noise, and (3) obstacle density variation. All experiments are performed in the Gazebo cross-domain simulation and MVTD dynamic navigation environments.
For fair and representative comparison, IPPO and HEA-PPO are selected as representative amphibious navigation baselines, RLCA represents rule-based maritime safety control, BarrierNet represents optimization-based safety-constrained control, and pH-DRL represents hierarchical long-horizon decision-making. These baselines respectively cover reinforcement learning-based cross-domain navigation, optimization-driven planning, rule-constrained safety control, safety-filtering control, and hierarchical planning, thus providing comprehensive perspectives for evaluating the robustness of CD-HSSRL.
R1: Hydrodynamic Disturbance. We vary water current velocity in the Gazebo environment from 0 to 1.5 m/s to simulate calm to strong flow conditions.
Table 4 reports the success rate (SR) and collision rate (CR) under different current intensities.
R2: Perception Noise: To simulate sensor uncertainty, Gaussian noise with increasing variance is added to observation features extracted from WaterScenes and MVTD.
Table 5 presents Cross-Domain Transition Success Rate (CTS) under different noise levels.
R3: Obstacle Density: We further increase the number of dynamic obstacles in MVTD and Gazebo environments to evaluate navigation robustness under crowded scenes. For long-horizon planning robustness comparison, pH-DRL is included as a representative hierarchical decision-making baseline.
Figure 12 illustrates SR degradation trends as obstacle density increases.
Result Analysis: From
Table 4 and
Table 5, CD-HSSRL consistently maintains higher SR and CTS and lower CR than all compared baselines under different disturbance levels. Notably, BarrierNet achieves relatively low collision rates due to conservative safety filtering, but its success rate degrades faster under strong currents and high perception noise, indicating limited adaptability to dynamic cross-domain disturbances. Meanwhile, pH-DRL shows more stable long-horizon planning under increased obstacle density, but it still suffers from switching oscillations during water–land transitions.
We further analyze representative failure cases observed during experiments. Failure typically occurs in (1) strong dynamic disturbances near transition regions, (2) ambiguous domain boundaries, and (3) high levels of sensor noise. These cases reveal that the switching mechanism may become unstable under rapidly changing conditions, leading to suboptimal decisions.
Overall, CD-HSSRL demonstrates effective robustness against hydrodynamic disturbances, perception uncertainty, and scene complexity, confirming that hierarchical safe switching and safety-constrained continuous control jointly contribute to stable and robust cross-domain navigation.
5.5. Parameter Sensitivity Analysis
The proposed CD-HSSRL framework introduces several key hyperparameters that control cross-domain switching stability, safety-constrained optimization, and terrain-aware navigation behavior. To verify that the performance improvements are not overly dependent on specific parameter settings, we conduct sensitivity analysis on four representative parameters: (1) switching regularization coefficient , (2) safety projection penalty coefficient , (3) hierarchical option termination threshold , and (4) cost-map weighting coefficient .
All experiments are conducted in the Gazebo + UUV cross-domain simulation environment using both water-to-land and multi-transition navigation tasks.
P1: Switching Regularization Coefficient
: The coefficient
controls the strength of the switching stability loss introduced in the Hierarchical Safe Switching Policy. We vary
from 0 to 1.0 and report CTS and SSI in
Table 6.
P2: Safety Projection Penalty
: The parameter
weights the constraint violation penalty in the Safety-Constrained Continuous Controller. We vary
from 0.1 to 2.0 and report the collision rate (CR) and energy consumption (EC) in
Table 7.
P3: Option Termination Threshold
: The threshold
determines when the high-level policy terminates a motion option and triggers medium switching. We vary
from 0.3 to 0.9 and evaluate the Cross-Domain Transition Success Rate (CTS).
Figure 13 visualizes the CTS variation trend.
P4: Cost-Map Weighting Coefficient
. The weighting coefficient
controls the influence of terrain-aware traversal costs in the global cost-map representation. Larger values encourage the agent to avoid risky transition regions and obstacle-dense areas, while smaller values prioritize shorter trajectories with weaker terrain-awareness. We vary
from 0.1 to 2.0 and evaluate the success rate (SR), collision rate (CR), and Average Path Length (PL). The results are summarized in
Table 8.
Result Analysis: From
Table 6 and
Table 7, CD-HSSRL achieves the best balance between switching stability, collision avoidance, and energy efficiency when
and
. Excessively small
leads to frequent mode oscillations, while overly large values reduce responsiveness near transition regions. Similarly, insufficient safety penalties increase collision risk, whereas excessively large
values result in overly conservative behaviors and increased energy consumption.
Table 8 further demonstrates that the proposed framework is moderately sensitive to the cost-map weighting coefficient. When
is too small, the agent tends to prioritize shorter paths while neglecting terrain risks, leading to higher collision rates and unstable cross-domain transitions. Conversely, excessively large
values encourage overly conservative navigation behaviors, resulting in longer trajectories and reduced navigation efficiency. The best overall trade-off is achieved near
, which balances terrain awareness, safety, and path efficiency.
Figure 13 shows that CTS remains relatively stable across a broad range of
values, indicating that CD-HSSRL is not overly sensitive to precise option termination threshold tuning.
Overall, the parameter sensitivity analysis demonstrates that CD-HSSRL maintains stable and robust performance across a wide range of hyperparameter configurations, confirming the robustness, interpretability, and reproducibility of the proposed framework.
5.6. Computational Cost and Scalability
The proposed CD-HSSRL framework introduces additional computational complexity compared to conventional single-policy reinforcement learning methods due to its hierarchical structure and modular components.
From an inference perspective, the framework consists of a high-level policy, multiple low-level controllers, and a safety projection module. However, these components operate at different temporal scales. The high-level policy is executed at a lower frequency to select sub-tasks or domains, while the low-level controller generates control commands at a higher frequency. As a result, the additional computational overhead during execution remains manageable for real-time applications.
In terms of training cost, the framework requires training multiple policies, which increases the total training time and computational resources. Nevertheless, this design improves learning efficiency in complex cross-domain environments by decomposing the task into more manageable sub-problems, leading to more stable convergence.
Regarding scalability, the modular architecture of the framework facilitates extension to more complex or multi-domain scenarios. New domains can be incorporated by introducing additional domain-specific policies without fundamentally modifying the overall structure. However, this scalability is partly constrained by the need for domain-specific knowledge, such as cost maps or environment annotations.
Overall, the proposed framework represents a trade-off between computational cost and performance, prioritizing robustness and adaptability in heterogeneous environments.
5.7. Discussion of Findings and Limitations
This study proposed CD-HSSRL, a Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework for autonomous amphibious robot navigation. Comprehensive experiments demonstrated that CD-HSSRL consistently outperforms representative baselines in water-domain navigation, land-domain planning, and water–land transition tasks. The results indicate that the Cross-Domain Global Reachability Planner effectively unifies heterogeneous environmental cost representations, the Hierarchical Safe Switching Policy enables stable medium-transition decisions, and the Safety-Constrained Continuous Controller effectively reduces collision risks during complex shoreline interactions.
Beyond overall performance gains, comparative experiments against recent high-quality baselines provide deeper insights. Safety-filtering methods such as BarrierNet achieve strong collision avoidance performance, yet they exhibit conservative behaviors and reduced transition efficiency. Hierarchical planning approaches such as pH-DRL and structured planning–learning methods such as MP-DQL demonstrate improved long-horizon decision-making, but they still suffer from unstable medium switching under discontinuous water–land dynamics. By jointly optimizing global reachability planning, hierarchical switching, and safety-constrained control through joint hierarchical optimization, CD-HSSRL overcomes these limitations and achieves a better balance between safety, stability, and navigation efficiency.
The experimental observations further suggest that explicitly modeling medium-switching stability is crucial for discontinuous cross-domain dynamics, where flat or purely hierarchical policies commonly suffer from oscillatory decisions near boundary regions. Moreover, integrating differentiable safety projection into continuous control not only improves collision avoidance but also enhances policy generalization under environmental uncertainties. These findings imply that hierarchical decision decomposition combined with constraint-aware control constitutes a promising paradigm for cross-domain robotic navigation beyond amphibious scenarios.
Despite the promising performance of the proposed CD-HSSRL framework, several limitations should be acknowledged. First, the current validation is conducted entirely in a simulation environment based on Gazebo and the UUV Simulator. Although the simulator incorporates hydrodynamic effects and provides a controllable and reproducible testing platform, it cannot fully capture the complexity and uncertainty of real-world amphibious environments. Factors such as unmodeled disturbances, sensor imperfections, and hardware constraints may affect real-world performance. Future work will focus on transferring the proposed framework to physical platforms and investigating sim-to-real adaptation strategies. Second, the proposed framework relies on several manually designed components, including cost maps and explicit domain labels (e.g., water, land, and transition regions). While these elements improve interpretability and control, they limit the level of autonomy and may reduce generalization to unseen environments where such prior knowledge is unavailable or inaccurate. Third, the method is primarily designed as an engineering-oriented system integration and does not provide formal theoretical guarantees regarding convergence, safety, or switching stability. Although empirical results demonstrate improved performance, a rigorous theoretical analysis would further strengthen the robustness of the framework. Fourth, the hierarchical structure and safety mechanisms introduce additional computational overhead compared to single-policy reinforcement learning approaches. This may limit real-time applicability in resource-constrained systems, particularly for high-frequency control tasks. Finally, the dataset-to-environment conversion process may introduce bias due to simplifications and assumptions made during mapping. Differences between the generated simulation scenarios and real-world environments may affect the generalization capability of the learned policies. Addressing these limitations constitutes an important direction for future research, including improving environment realism, reducing reliance on manual design, enhancing computational efficiency, and validating the framework in real-world deployments.
6. Conclusions
This paper investigated the problem of autonomous cross-domain navigation for amphibious robotic systems operating in heterogeneous water–land environments. The main research question addressed in this work is whether a hierarchical reinforcement learning framework with adaptive switching and safety-aware control can improve navigation stability and robustness under discontinuous dynamics. To address this problem, we proposed the CD-HSSRL framework, which integrates hierarchical decision-making, safety projection, and adaptive switching mechanisms into a unified navigation architecture.
The experimental results in the Gazebo + UUV simulation environment demonstrate that the proposed method achieves favorable performance compared with baseline approaches, achieving higher success rates and lower collision rates across water, land, and transition environments. In particular, in cross-domain scenarios, the proposed method improves the success rate by approximately 20% compared to conventional RL methods while maintaining stable performance under environmental disturbances. These results indicate that the proposed framework is effective for handling heterogeneous dynamics and complex navigation tasks.
However, the current study is limited to simulation-based validation, and future work will focus on real-world experiments and sim-to-real transfer. Sim-to-real transfer represents a complementary research direction, and integrating such techniques into the proposed framework is an important avenue for future work.