CD-HSSRL: Cross-Domain Hierarchical Safe Switching Reinforcement Learning Framework for Autonomous Amphibious Robot Navigation

Liu, Shuang; Wei, Lei; Li, Xiaoqing

doi:10.3390/jmse14090859

Open AccessArticle

CD-HSSRL: Cross-Domain Hierarchical Safe Switching Reinforcement Learning Framework for Autonomous Amphibious Robot Navigation

by

Shuang Liu

,

Lei Wei

and

Xiaoqing Li

^*

School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(9), 859; https://doi.org/10.3390/jmse14090859

Submission received: 2 April 2026 / Revised: 29 April 2026 / Accepted: 30 April 2026 / Published: 3 May 2026

(This article belongs to the Special Issue Advances in Modelling, Navigation, and Intelligent Control of Marine Vehicles and Robotics)

Download

Browse Figures

Versions Notes

Abstract

Autonomous tracked amphibious robotic systems operating across water and land environments are essential for coastal inspection, disaster response, environmental monitoring, and complex terrain exploration. However, discontinuous water–land dynamics, unstable medium switching, and safety-critical control under environmental uncertainty pose significant challenges to existing amphibious navigation and path planning methods, where global reachability and adaptive decision-making are difficult to unify. Motivated by these challenges, this paper proposes CD-HSSRL, a Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework for autonomous tracked amphibious navigation. Specifically, a Cross-Domain Global Reachability Planner is developed to construct unified cost representations across heterogeneous water–land environments, a Hierarchical Safe Switching Policy enables stable medium-transition decision-making through option-based policy decomposition with switching regularization, and a Safety-Constrained Continuous Controller integrates action safety projection and risk-sensitive reward shaping to ensure collision-free control during complex shoreline interactions. These components are jointly optimized to achieve robust cross-domain navigation. The experimental results in the Gazebo + UUV simulation environment show that the proposed method demonstrates competitive performance compared with baseline approaches, achieving higher success rates and lower collision rates across water, land, and transition environments. In particular, in cross-domain scenarios, the proposed method improves success rates by approximately 20% compared to conventional RL methods while maintaining stable performance under environmental disturbances. Robustness and ablation studies further verify the effectiveness of hierarchical switching and safety-constrained control mechanisms. Overall, this work establishes an integrated framework for safe and robust cross-domain navigation of tracked amphibious robotic systems, providing new insights into hierarchical safe-switching architectures for multi-medium autonomous robots.

Keywords:

tracked amphibious robots; cross-domain navigation; reinforcement learning; hierarchical safe switching; safety-constrained control

1. Introduction

Autonomous tracked amphibious robotic systems capable of operating seamlessly across water and land environments play an increasingly important role in coastal inspection [1], environmental monitoring [2], disaster rescue [3], and maritime transportation applications [4]. Compared with single-medium robotic systems, amphibious platforms provide superior mission flexibility and accessibility in complex terrains where water and land coexist [5]. However, enabling robots to autonomously navigate across heterogeneous environments remains a fundamental challenge [6], as water–land transitions involve discontinuous dynamics [7], rapidly changing environmental constraints [8], and safety-critical interactions with uncertain surroundings [9].

Recent advances in learning-based robotic navigation have demonstrated remarkable success in single-domain path planning and obstacle avoidance for unmanned surface vehicles and ground robots [10]. Nevertheless, most existing approaches are designed for either water or land environments independently [11], and their policies often fail when directly transferred across domains due to inconsistent state representations, abrupt medium switching, and unmodeled physical constraints [12]. Consequently, current methods suffer from unstable transition decisions near shorelines, oscillatory behaviors during medium switching, and elevated collision risks, which significantly limit the real-world deployment of amphibious robotic systems.

To address these limitations, this study investigates the following research question: How can an autonomous robot achieve safe, stable, and efficient navigation across discontinuous water–land environments under environmental uncertainty? We hypothesize that explicitly modeling cross-domain reachability, hierarchical switching decisions, and safety-constrained control is essential to achieve robust amphibious navigation.

The objective of this work is to develop an integrated framework that integrates global cross-domain planning, medium-switching decision-making, and safety-aware continuous control into a coherent joint optimization scheme. However, solving this problem involves several critical challenges. First, water and land environments exhibit fundamentally different dynamic constraints, making it difficult to construct unified environmental representations for global planning. Second, naive policy structures struggle to produce stable medium-switching decisions near boundary regions, leading to frequent oscillations and control instability. Third, safety-critical constraints during shoreline interactions require explicit collision and grounding avoidance mechanisms beyond standard learning formulations.

Motivated by these challenges, we propose CD-HSSRL, a Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework for autonomous amphibious navigation. The proposed framework introduces a Cross-Domain Global Reachability Planner to construct unified cost-aware environmental representations, enabling consistent long-horizon planning across water and land. A Hierarchical Safe Switching Policy is designed to decompose navigation into high-level medium-switching decisions and low-level motion control, enforcing switching stability through regularized option learning. Furthermore, a Safety-Constrained Continuous Controller integrates action safety projection and risk-sensitive reward shaping to guarantee collision-free and stable control during complex water–land transitions. These modules are jointly optimized to achieve unified planning–switching–safety co-optimization for robust cross-domain navigation.

The main contributions of this paper are summarized as follows: (1) We propose a novel Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework that unifies water–land navigation, medium-switching decision-making, and safety-critical control into a jointly optimized hierarchical architecture. (2) We develop a Cross-Domain Global Reachability Planner and a Hierarchical Safe Switching Policy that enable stable and robust amphibious navigation under discontinuous environmental dynamics. (3) We design a Safety-Constrained Continuous Controller that explicitly enforces physical safety constraints during shoreline interaction. (4) Extensive experiments on multiple water-domain, land-domain, and cross-domain benchmarks demonstrate that the proposed method achieves competitive performance compared with baselines in navigation success rate, transition stability, and collision avoidance performance.

2. Related Work

2.1. Amphibious and Cross-Domain Robot Navigation

Amphibious and cross-domain robotic systems have emerged as an important research topic due to their ability to operate in heterogeneous environments where water and land coexist [13,14]. Typical application scenarios include coastal inspection, flood rescue [15], ecological monitoring, and military reconnaissance [16]. Compared with single-medium robotic platforms, amphibious robots offer greater mission flexibility but also face fundamentally different environmental constraints when transitioning between water and land. Early studies on amphibious navigation mainly relied on model-based planning frameworks [17], such as graph search, sampling-based path planning, and optimization-based trajectory generation. These approaches usually construct separate environmental models for water and land and design heuristic cost functions to guide navigation. While such methods achieve acceptable performance in structured environments, they heavily depend on accurate dynamic modeling and manually designed traversability maps, which limits their adaptability in complex and uncertain real-world scenes.

More recent research has introduced learning-based strategies to enhance amphibious navigation performance [18]. Learning-based planners can capture nonlinear hydrodynamic effects and terrain interactions more effectively than traditional analytical models [19]. However, most existing works still treat water-domain and land-domain planning as two loosely coupled processes, and medium transitions are often handled by handcrafted switching logic or predefined thresholds. This separation leads to inconsistent decision-making near shoreline boundaries, unstable transitions, and reduced robustness under environmental disturbances. Therefore, a unified planning and decision-making framework capable of representing heterogeneous environments and enabling smooth cross-domain transitions remains an open research challenge.

2.2. Reinforcement Learning for Water-Domain and Land-Domain Navigation

Reinforcement learning has become a powerful tool for autonomous navigation in both water and land environments [20,21]. Among these methods, widely adopted algorithms such as Proximal Policy Optimization (PPO) and Soft Actor–Critic (SAC) have demonstrated strong performance in continuous control problems. In water-domain navigation, deep reinforcement learning has been widely applied to amphibious and surface robotic platforms for dynamic obstacle avoidance, collision-free control, and energy-efficient path planning. Li et al. [22] developed an RL-based path planning framework for autonomous underwater vehicles (AUVs), showing improved adaptability in dynamic ocean environments. Mou et al. [23] proposed a reinforcement learning-based navigation strategy for unmanned surface vehicles, emphasizing safety and trajectory optimization. These methods benefit from end-to-end policy learning and can adapt to complex maritime environments without explicit hydrodynamic modeling. Similarly, in land-domain navigation, reinforcement learning has demonstrated strong capabilities in mobile robot path planning, multi-agent coordination, and navigation in dynamic scenes. Tao et al. [24] proposes an algorithm called Adaptive Soft Actor–Critic (ASAC), which combines the Soft Actor–Critic (SAC) algorithm, tile coding, and the Dynamic Window Approach (DWA) to enhance path planning capabilities. Such approaches allow robots to learn reactive and anticipatory behaviors directly from environmental interactions.

Despite these advances, most existing reinforcement learning methods are developed and trained in a single domain. Policies learned in water environments usually fail when directly transferred to land environments, and vice versa, due to inconsistent observation distributions, abrupt changes in motion dynamics, and different safety constraints. As a result, existing single-domain reinforcement learning frameworks struggle to generalize across heterogeneous environments and cannot guarantee stable decision-making during water–land transitions. This limitation highlights the necessity of developing cross-domain reinforcement learning frameworks that explicitly model medium-dependent dynamics and enable knowledge sharing between water and land navigation policies.

2.3. Hierarchical Reinforcement Learning and Medium-Switching Decision Making

Hierarchical reinforcement learning decomposes complex decision-making tasks into multiple temporal or functional layers [25], typically including a high-level planner and a low-level controller. Such hierarchical structures improve learning efficiency, interpretability, and scalability, especially in long-horizon robotic navigation problems. Option-based frameworks further enable the learning of temporally extended actions, allowing robots to switch between different behavioral modes based on environmental contexts.

Hierarchical reinforcement learning has been successfully applied to task decomposition, navigation subtasks, and skill sequencing in robotics. However, existing hierarchical approaches primarily focus on abstract task or goal decomposition and seldom consider physical medium-switching in real-world robotic systems. In amphibious navigation, medium-switching decisions correspond to physically distinct motion regimes, such as floating, shoreline climbing, and ground driving. Without explicit switching stability modeling, hierarchical policies often generate oscillatory decisions near transition regions, resulting in inefficient control and increased risk of grounding or collision. This reveals the need for a hierarchical reinforcement learning framework specifically designed to handle physical medium transitions and enforce stable switching behavior in cross-domain navigation.

Recent hierarchical navigation frameworks such as predictive hierarchical deep reinforcement learning (pH-DRL [26]) and motion-primitive-based deep Q-learning (MP-DQL [27]) further demonstrate the effectiveness of long-horizon decision decomposition and structured action spaces in robotic planning. However, these approaches are mainly designed for single-domain navigation and do not explicitly model physical medium transitions or switching stability in amphibious systems.

2.4. Safe Reinforcement Learning and Constraint-Aware Robotic Control

Safety is a critical requirement for autonomous robots operating in real environments. To address safety concerns, safe reinforcement learning techniques have been proposed to incorporate physical constraints and risk-awareness into policy learning. Common strategies include action projection layers that filter unsafe control commands, constraint-aware optimization objectives, and risk-sensitive reward formulations that penalize unsafe behaviors. These methods effectively improve collision avoidance and system robustness in single-domain navigation tasks.

However, most existing safe reinforcement learning frameworks focus on either water-domain or land-domain safety constraints independently. In cross-domain amphibious navigation, safety risks are amplified during medium transitions, such as shoreline climbing, water entry, and obstacle interaction at boundary regions. Existing safe control strategies are rarely integrated with medium-switching decision policies or global cross-domain planners, leading to fragmented safety handling mechanisms. Consequently, guaranteeing safety throughout the entire water–land transition process remains challenging. This motivates the development of an integrated framework that jointly considers safety constraints, medium-switching decisions, and cross-domain planning in an integrated learning architecture.

Representative safety-filtering frameworks such as BarrierNet [28] introduce differentiable control barrier functions to enforce safety constraints during policy execution. While these methods provide strong collision avoidance guarantees, they are not integrated with hierarchical medium-switching decision policies or cross-domain global planners, limiting their applicability in discontinuous water–land navigation.

Some recent studies have focused on sim-to-real transfer in reinforcement learning, aiming to bridge the gap between simulation and real-world deployment. These approaches typically employ domain randomization, adaptation learning, or hierarchical training strategies [29,30]. However, most existing sim-to-real methods assume a single-domain setting with consistent dynamics, and they do not explicitly address cross-domain navigation involving discontinuous transitions, such as water–land scenarios.

Overall, prior research has made significant progress in amphibious navigation, single-domain reinforcement learning, hierarchical decision-making, and safe control. Nevertheless, a unified approach that simultaneously addresses cross-domain environmental representation, stable medium-switching decisions, and safety-constrained continuous control is still lacking. To bridge this gap, this paper proposes CD-HSSRL, a Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework that integrates global reachability planning, hierarchical switching policies, and safety-constrained control to achieve robust autonomous amphibious navigation.

3. Method

In this section, we present the proposed Cross-Domain Hierarchical Safe Switching Reinforcement Learning (CD-HSSRL) framework for autonomous navigation and path planning of amphibious robots across water–land environments. We first formulate the cross-domain navigation problem and then introduce the overall hierarchical architecture, followed by the detailed design of each functional module.

3.1. Problem Formulation

We consider an amphibious robot operating in a mixed water–land environment. The environment is represented by a cross-domain state space

S = S_{w} \cup S_{l} \cup S_{t}

, where

S_{w}

,

S_{l}

, and

S_{t}

denote water, land, and transition regions, respectively. The robot dynamics vary across domains, leading to discontinuous motion models.

The navigation objective is to find a policy

π (a | s)

that drives the robot from a start state

s_{0}

to a goal state

s_{g}

while minimizing cumulative cost and satisfying safety constraints. This problem is formulated as a constrained Markov decision process (CMDP):

M = 〈 S, A, P, R, C, γ 〉,

(1)

where

A

is the action space,

P (s^{'} | s, a)

is the transition probability,

R (s, a)

is the reward function,

C (s, a)

denotes constraint cost, and

γ \in (0, 1)

is the discount factor.

The optimization objective is

max_{π} E_{π} [\sum_{t = 0}^{T} γ^{t} R (s_{t}, a_{t})],

(2)

subject to the safety constraint

E_{π} [\sum_{t = 0}^{T} γ_{c}^{t} C (s_{t}, a_{t})] \leq d,

(3)

where d is a predefined safety threshold limiting collision, grounding, or rule-violation risks, and

γ_{c}

denotes the discount factor for accumulated safety costs.

The key challenge lies in simultaneously handling (1) discontinuous dynamics across water–land domains, (2) long-horizon global reachability under heterogeneous environmental costs, and (3) safety-critical control during medium transitions.

3.2. Platform Description

The target platform used in this study is a tracked-thruster amphibious robot designed specifically for operation in both land and underwater environments. As shown in Figure 1, the robot is equipped with a tracked land mobile unit and a thruster-type underwater propulsion unit. This hybrid configuration allows the robot to traverse complex shoreline environments while maintaining maneuverability in shallow and deep water.

The onboard sensing suite consists of a Global Positioning System (GPS), an Inertial Measurement Unit (IMU), a 3D lidar sensor, an ultrasonic sensor, and a water depth sensor. These sensors provide complementary information that supports both navigation and motion control.

3.3. Overall Framework of CD-HSSRL

To address the above challenges, we propose CD-HSSRL (Cross-Domain Hierarchical Safe-Switching Reinforcement Learning), a hierarchical planning–learning architecture for autonomous amphibious robot navigation across water–land environments. The framework decomposes the amphibious navigation problem into three cooperative layers, enabling structured decision-making from long-horizon planning to low-level safe control.

Formally, the overall navigation policy is factorized as

π (a | s) = π_{L} (a | s, o; θ_{L}) π_{H} (o | s; θ_{H}),

(4)

where

π_{H} (o | s; θ_{H})

denotes a high-level switching policy that selects a domain-specific motion option

o \in {water, transition, land}

, and

π_{L} (a | s, o; θ_{L})

denotes a low-level continuous control policy that generates executable control actions conditioned on the selected option.

The CD-HSSRL framework consists of three major components.

First, the Cross-Domain Global Reachability Planner constructs a unified cost-aware representation of the water–land environment and generates a global waypoint sequence that guarantees long-horizon reachability while avoiding risky regions such as shallow waters, steep shorelines, and high-friction terrains.

Second, the Hierarchical Safe Switching Policy learns when and where to switch between water, transition, and land motion modes. This high-level policy integrates global waypoint guidance and current state observations to produce stable and consistent medium-switching decisions under discontinuous cross-domain dynamics.

Third, the Safety-Constrained Continuous Controller produces smooth and safe continuous control actions under physical and rule-based constraints. A safety projection layer filters raw actions to satisfy collision avoidance, shoreline stability, and maritime rule compliance, while a risk-sensitive reward formulation further encourages safe navigation behaviors.

By jointly optimizing the high-level switching policy and the low-level controller, the proposed framework achieves coordinated cross-domain decision-making and safety-aware motion control.

The overall architecture of the proposed CD-HSSRL framework is illustrated in Figure 2.

3.4. Cross-Domain Global Reachability Planner

The Cross-Domain Global Reachability Planner (CD-GRP) is designed to generate a long-horizon feasible navigation skeleton that guarantees reachability across heterogeneous water–land environments while avoiding high-risk regions. Unlike conventional global planners that operate on single-terrain maps, CD-GRP constructs a unified cost-aware representation integrating water depth, shoreline slope, land traction, and obstacle distributions.

Specifically, four domain-dependent cost layers are first constructed:

D (x, y) : water depth risk cost,

(5)

S (x, y) : shoreline slope transition cost,

(6)

F (x, y) : terrain friction cost,

(7)

O (x, y) : obstacle occupancy \cos t .

(8)

These cost layers are fused into a unified cross-domain cost map:

G (x, y) = α D (x, y) + β S (x, y) + δ F (x, y) + η O (x, y),

(9)

where

α, β, δ

, and

η

are weighting coefficients balancing safety and traversability considerations.

The weighting coefficients control the relative importance of different navigation objectives, including obstacle avoidance, terrain traversability, and transition stability. In this work, the coefficients are initialized using heuristic prior knowledge and subsequently adjusted through empirical validation experiments. Specifically, higher water depth cost encourages land movement; transition-related weights improve switching smoothness near water–land boundaries; higher terrain-cost weights discourage traversal through unstable or shallow regions; higher obstacle weights encourage safer navigation behavior.

The final parameter values are selected based on performance trade-offs observed during validation experiments.

Based on the unified cost map

G (x, y)

, an incremental global path search is performed to obtain an optimal reachability path:

P^{*} = arg min_{P} \sum_{(x, y) \in P} G (x, y),

(10)

with heuristic-guided evaluation:

f (n) = g (n) + h (n),

(11)

where

g (n)

denotes the accumulated cost from the start node to node n, and

h (n)

is the heuristic distance estimate to the goal. The incremental search mechanism enables efficient replanning under dynamic environmental updates.

The final output of CD-GRP is a global waypoint sequence:

W = {w_{1}, w_{2}, \dots, w_{K}},

(12)

which provides high-level guidance for the subsequent hierarchical switching policy.

The overall cost-map fusion and incremental reachability planning process of CD-GRP is illustrated in Figure 3.

3.5. Hierarchical Safe Switching Policy

Due to the discontinuous dynamics between water and land motion, directly learning a monolithic policy often leads to unstable behaviors during medium transitions. To address this issue, we propose a Hierarchical Safe Switching Policy (HSSP), which learns to select appropriate domain-specific motion modes while maintaining switching stability.

At each decision step, the high-level switching policy selects an option:

o_{t} \sim π_{H} (o | s_{t}; θ_{H}), o_{t} \in {water, transition, land},

(13)

where

π_{H} (o | s_{t}; θ_{H})

is a neural policy network that takes the current state

s_{t}

and global waypoint guidance

W

as input.

Once an option is selected, it remains active until a termination condition is satisfied:

β (o_{t} | s_{t}) = \{\begin{matrix} 1, & if s_{t} \in S_{t} or waypoint reached, \\ 0, & otherwise . \end{matrix}

(14)

To discourage unnecessary frequent medium switching, a switching regularization loss is introduced:

L_{s w} = λ_{s w} {∥ π_{H} (\cdot | s_{t}) - π_{H} (\cdot | s_{t - 1}) ∥}_{2}^{2},

(15)

where

λ_{s w}

controls the stability penalty strength, which softly penalizes abrupt changes in option distributions to ensure stable and smooth medium-switching behaviors.

The switching regularization term is introduced to balance transition responsiveness and policy stability during cross-domain navigation. Without regularization, the agent may frequently oscillate between navigation modes near ambiguous transition boundaries, resulting in unstable trajectories and increased collision risk. Conversely, excessively strong regularization suppresses switching behavior and may delay necessary adaptation when environmental conditions change rapidly. Therefore, the switching penalty introduces an explicit trade-off between adaptability and stability, which is particularly important in heterogeneous transition regions where environmental dynamics change discontinuously.

The high-level switching policy is optimized using a clipped PPO objective:

L_{H} (θ_{H}) = E_{t} [min (r_{t} (θ_{H}) {\hat{A}}_{t}, clip (r_{t} (θ_{H}), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

(16)

where

r_{t} (θ_{H}) = \frac{π_{H} (o_{t} | s_{t}; θ_{H})}{π_{H} (o_{t} | s_{t}; θ_{H}^{o l d})}

and

{\hat{A}}_{t}

is the advantage estimate.

The execution loop and optimization flow of the proposed HSSP are illustrated in Figure 4.

3.6. Safety-Constrained Continuous Controller

While the high-level policy determines the motion mode, the low-level controller must generate continuous control actions that are dynamically feasible and safe in real time. We therefore design a Safety-Constrained Continuous Controller (SCCC) that integrates stochastic policy learning with explicit safety constraint enforcement. In particular, the safety constraints explicitly encode collision avoidance and shoreline grounding prevention, which are critical failure modes during water–land medium transitions.

The low-level control policy outputs a raw action:

a_{t} \sim π_{L} (a | s_{t}, o_{t}; θ_{L}),

(17)

where

π_{L} (a | s_{t}, o_{t}; θ_{L})

is a stochastic actor network conditioned on the current state and selected option.

Although the low-level policy can generate continuous control commands, the raw action may violate safety requirements in obstacle-dense or shallow-water transition regions. Therefore, before execution, the action is checked against a set of explicitly defined safety constraints and projected into the feasible safe action space when necessary:

a_{t}^{safe} = Π_{A_{safe} (s_{t})} (a_{t})

(18)

where

a_{t}

denotes the raw action generated by the low-level policy,

a_{t}^{safe}

is the corrected safe action, and

Π_{A_{safe} (s_{t})} (\cdot)

represents the projection operator onto the state-dependent safe action set

A_{safe} (s_{t})

:

A_{safe} (s_{t}) = \{a_{t} \in A | g_{obs} (s_{t}, a_{t}) \geq 0, g_{ground} (s_{t}, a_{t}) \geq 0, g_{dyn} (s_{t}, a_{t}) \geq 0\}

(19)

where

g_{obs}

,

g_{ground}

, and

g_{dyn}

represent obstacle avoidance, grounding prevention, and dynamic feasibility constraints, respectively.

g_{obs} (s_{t}, a_{t}) = d (p_{t + 1}, O) - d_{min} \geq 0

(20)

where

p_{t + 1}

denotes the predicted next position,

O

represents surrounding obstacles, and

d_{min}

is the predefined minimum safety distance.

g_{ground} (s_{t}, a_{t}) = h (p_{t + 1}) - h_{min} \geq 0

(21)

where

h (p_{t + 1})

denotes the local water depth or terrain clearance, and

h_{min}

is the minimum safe operating depth required to avoid grounding.

g_{dyn} (s_{t}, a_{t}) = a_{max} - ∥ a_{t} ∥ \geq 0

(22)

where

a_{max}

denotes the maximum allowable control magnitude.

At each control step, the next-state position

p_{t + 1}

is estimated based on the current vehicle state and candidate action. Obstacle distances are computed from the cost map, while terrain elevation and local depth information are extracted from the Gazebo + UUV simulation environment.

If all constraints are satisfied, the original action is directly executed. Otherwise, the action is projected into the feasible safe action set by adjusting its magnitude or direction.

In addition, we introduce a risk-sensitive reward shaping strategy:

R_{t}^{s a f e} = R_{t} - κ P (collision or grounding | s_{t}, a_{t}),

(23)

where

κ

is the risk penalty coefficient. This formulation encourages the controller to prioritize safe behaviors while preserving navigation efficiency.

The low-level policy is optimized using the Soft Actor–Critic (SAC) objective:

L_{L} (θ_{L}) = E_{(s_{t}, a_{t}) \sim D} [α log π_{L} (a_{t} | s_{t}, o_{t}) - Q_{θ_{Q}} (s_{t}, a_{t})],

(24)

where

Q_{θ_{Q}}

is trained using the standard soft Bellman residual in Soft Actor–Critic.

The overall architecture and optimization loop of SCCC are illustrated in Figure 5.

3.7. Training Objective and Optimization

The overall CD-HSSRL framework is trained by jointly optimizing the high-level switching policy and the low-level controller. The total loss function is defined as

L_{t o t a l} = L_{H} + L_{L} + L_{s w},

(25)

where

L_{H}

denotes the PPO loss for the medium-switching policy,

L_{L}

denotes the SAC loss for continuous control, and

L_{s w}

is the switching regularization term.

Parameters

θ_{H}

and

θ_{L}

are updated using stochastic gradient descent:

θ \leftarrow θ - η \nabla_{θ} L_{t o t a l},

(26)

where

η

is the learning rate.

This joint optimization allows coordinated learning between global switching decisions and local continuous control behaviors.

3.8. Algorithm Pseudocode

Algorithm 1 outlines the training procedure of the proposed CD-HSSRL (Cross-Domain Hierarchical Safe-Switching Reinforcement Learning), integrating global reachability planning, hierarchical medium-switching learning, and safety-constrained continuous control.

Algorithm 1: CD-HSSRL Training Procedure

4. Experiments

4.1. Datasets and Experimental Settings

To comprehensively evaluate the proposed CD-HSSRL framework for water–land cross-domain autonomous navigation and path planning, we conduct experiments on a suite of publicly available real-world datasets and a physics-based cross-domain amphibious simulation benchmark. This experimental design ensures that water-surface navigation, dynamic obstacle avoidance, land-based planning, and cross-domain transition behaviors are rigorously validated under reproducible conditions while enabling fair comparison with hierarchical planning and safety-constrained control baselines.

To improve the diversity and robustness of evaluation, all experiments are conducted under randomized initialization conditions, including randomized start/goal positions, obstacle layouts, and environmental disturbances. This randomized evaluation strategy helps reduce overfitting to specific environment configurations and provides a broader assessment of cross-domain navigation behavior.

WaterScenes Dataset: For water-surface environment perception and navigation evaluation, we adopt the WaterScenes dataset [31], which is a large-scale multimodal dataset containing synchronized radar and monocular camera data collected in real maritime environments. The dataset provides annotated water-surface scenes with moving vessels, shoreline structures, and free-space segmentation labels, enabling reliable construction of water-surface navigation states. In our experiments, WaterScenes is used to construct perception-driven water-domain navigation scenarios by converting semantic free-space and obstacle annotations into navigable occupancy and risk maps. Thus, the dataset serves as a realistic maritime perception benchmark for generating navigation states rather than providing direct control labels. It supports the evaluation of water-mode planning and collision avoidance performance under real visual sensing conditions. This dataset is primarily used to benchmark water-surface navigation baselines such as APF-DQN, I-DDPG, MORL, RLCA, and APF-D3QNPER.

Maritime Visual Tracking Dataset (MVTD): To assess dynamic obstacle avoidance in complex marine environments, we employ the Maritime Visual Tracking Dataset (MVTD) [32], which contains high-resolution video sequences of vessels under diverse sea states and lighting conditions. MVTD enables the construction of highly dynamic navigation scenarios with moving maritime targets by transforming visual tracking sequences into dynamic obstacle fields for decision-making evaluation. Therefore, MVTD is employed as a perception-driven dynamic navigation benchmark for assessing temporal decision-making and the safety performance of learning-based planners. This dataset is used to validate dynamic avoidance capabilities against baselines, including APF-D3QNPER, RLCA, CLPPO-GIC, and MORL-based methods.

BARN Ground Navigation Benchmark: To evaluate land-domain navigation and provide a standardized ground-planning baseline, we use the Benchmark for Autonomous Robot Navigation (BARN) [33], which consists of procedurally generated navigation environments with varying obstacle densities and complexity levels. BARN is employed to test land-mode planning and continuous control performance of CD-HSSRL and to compare against amphibious and multi-objective baselines such as IPPO, DDQN, HEA-PPO, and IMTCMO. In addition, hierarchical planning baselines such as pH-DRL and planning–learning integration methods such as MP-DQL are evaluated on BARN to benchmark long-horizon hierarchical decision-making and structured planning performance.

Cross-Domain Amphibious Benchmark Environment: Currently, no publicly available dataset contains real-world navigation data involving continuous water–land transition behaviors. To evaluate cross-domain switching and safety-constrained control under realistic physical constraints, we construct a physics-based cross-domain amphibious benchmark environment in Gazebo with water-surface and ground-contact plugins. The simulator models water depth variation, shoreline slope transitions, hydrodynamic drag, terrain friction, and obstacle interactions, thereby forming a reproducible benchmark for water–land transition evaluation. This benchmark environment is used to assess cross-domain reachability planning, medium-switching stability, and safety-constrained control performance of CD-HSSRL. Furthermore, safety-aware baselines such as BarrierNet are evaluated in this environment to compare safety-constrained continuous control and collision-avoidance performance, while pH-DRL and MP-DQL are also tested to benchmark hierarchical switching and planning–learning coupling in cross-domain tasks.

Task Protocol and Data Split: For each dataset and simulation environment, navigation tasks are generated by randomly sampling start and goal positions under domain-specific constraints. Each scenario is evaluated under 100 randomized navigation episodes. For reinforcement learning training, 80% of the generated episodes are used for training, 10% for validation, and 10% for testing. All baselines and the proposed method are trained and evaluated under identical environment settings to ensure fair comparison. All scenario generation scripts, environment configurations, and evaluation protocols will be released to ensure reproducibility.

To construct realistic and controllable navigation scenarios, the original dataset is transformed into a simulation-compatible environment within the Gazebo + UUV framework. The conversion process consists of the following stages. (1) Data Preprocessing: Raw dataset inputs, including spatial layouts and environmental features, are first normalized and discretized into a structured representation. Irrelevant or noisy elements are filtered to ensure consistency with the simulation requirements. (2) Environment Mapping: The processed data are mapped into a simulation environment by generating corresponding terrain structures, obstacle distributions, and domain labels (e.g., water, land, and transition regions). In particular, depth information and terrain elevation are used to define heterogeneous regions and cross-domain boundaries. (3) Physical Parameter Assignment: To ensure realistic dynamics, physical parameters such as friction coefficients, drag forces, and buoyancy effects are assigned based on the mapped environment. These parameters are integrated into the UUV Simulator to reflect domain-specific behaviors. (4) Scenario Generation: Multiple navigation scenarios are generated by varying start and goal positions, obstacle densities, and environmental conditions. This allows systematic evaluation under diverse and controlled settings.

Despite enabling flexible environment construction, the dataset-to-environment conversion process may introduce several sources of bias. First, discretization and simplification of raw data may lead to loss of fine-grained environmental details, potentially affecting the fidelity of terrain representation. Second, the mapping from dataset features to simulation parameters relies on predefined assumptions, which may not fully capture real-world variability. Third, the generated scenarios may exhibit distributional differences compared to real-world environments, particularly in terms of dynamic interactions and sensor noise characteristics. Finally, the use of domain labels introduces a level of abstraction that may oversimplify complex cross-domain transitions. To mitigate these issues, multiple scenarios with varying configurations are evaluated, and robustness experiments under disturbances are conducted to assess generalization performance.

Through the above experimental setup, the proposed CD-HSSRL framework is systematically evaluated on water-domain navigation, land-domain planning, dynamic obstacle avoidance, hierarchical decision-making, and cross-domain transition tasks, providing comprehensive validation of its effectiveness, safety, and generalization ability. Figure 6 demonstrated the overall experimental process.

4.2. Implementation Details

All experiments are implemented in Python 3.13.5 using the PyTorch deep learning framework. The reinforcement learning components are built upon the OpenAI Gym interface and Stable-Baselines3 library, while the amphibious simulation environment is developed in Gazebo with UUV-Simulator plugins. All experiments are conducted on a workstation equipped with an NVIDIA RTX 4090 GPU and an Intel Xeon CPU.

Network Architecture: For the high-level switching policy

π_{H}

, we adopt a multilayer perceptron with two hidden layers of 256 units, followed by a softmax output layer for option selection. For the low-level continuous control policy

π_{L}

, we use an actor–critic architecture with two fully connected hidden layers of 256 units. ReLU activation is applied in all hidden layers. The Q-networks in SAC and value networks in PPO share the same backbone structure for fair comparison across learning-based baselines. For hierarchical baselines such as pH-DRL, the high-level and low-level networks follow the original two-layer hierarchical architecture described in their implementation. For MP-DQL, motion primitive libraries are constructed according to the original setting, and DQN networks are implemented with the same backbone size as our planner network. For the safety-control baseline BarrierNet, the differentiable barrier layer is integrated on top of a continuous control policy network with identical hidden dimensions.

State and Action Representation: The state input

s_{t}

consists of local observation features, global waypoint guidance, and domain indicators (water, transition, and land). For WaterScenes and MVTD, visual observations are encoded using a lightweight convolutional encoder to extract semantic features. For BARN- and Gazebo-based environments, LiDAR-like occupancy grids and robot kinematic states are used as inputs. All learning-based baselines, including pH-DRL, MP-DQL, and BarrierNet, are adapted to use the same unified observation space and action definitions to ensure fair comparison. The action space includes continuous linear velocity and angular velocity commands.

Training Hyperparameters: The discount factor is set to

γ = 0.99

. For the PPO-based high-level switching policy, the clipping parameter is

ϵ = 0.2

, and the learning rate is

3 \times 10^{- 4}

. For the SAC-based low-level controller, the entropy coefficient

α

is automatically tuned, and the learning rate is

3 \times 10^{- 4}

. The switching regularization coefficient is set to

λ_{s w} = 0.05

, and the safety risk penalty coefficient is

κ = 1.0

. The replay buffer size is

1 \times 10^{6}

, and mini-batches of size 256 are sampled for each update. For pH-DRL and MP-DQL baselines, the original hyperparameters reported in their papers are adopted and then slightly tuned to match the unified environment scale. For BarrierNet, the barrier function penalty coefficient follows the default setting in the original implementation.

Training Protocol: All methods are trained for 2 million environment interaction steps. For each baseline and the proposed method, multiple randomized navigation scenarios are evaluated under different initialization conditions. Model checkpoints with the best validation performance are selected for final testing. To ensure fair comparison, all baselines are trained using the same observation space, action space, reward definitions, and environment settings.

Simulation Settings: In the Gazebo amphibious simulation, water drag coefficients, shoreline slope limits, and terrain friction parameters are calibrated according to standard USV and ground robot dynamic models. Collision detection and grounding events are monitored to compute safety-related evaluation metrics. The simulation runs at 20 Hz control frequency for all tested methods, including safety-constrained baselines such as BarrierNet.

Reproducibility: All datasets used in this study are publicly available. The simulation environment configuration files, training scripts, and evaluation protocols are publicly available at https://github.com/ls142968/CD-HSSRL.git (accessed on 28 April 2026). These implementation settings ensure stable training, fair baseline comparison, and reproducible evaluation for cross-domain amphibious navigation and path planning.

4.3. Baselines

To comprehensively evaluate the effectiveness of the proposed cross-domain navigation framework for autonomous tracked amphibious robotic systems, we compare our method with a set of representative and recent baselines covering amphibious cross-domain path planning, learning-based water-domain navigation, collision avoidance under rule-constrained navigation, multi-objective decision-making, and safety-aware hierarchical control. All selected baselines are derived from published studies with explicitly named methodologies and established experimental protocols. This comparison set ensures a fair and comprehensive validation of global planning capabilities, cross-medium adaptability, dynamic obstacle avoidance, hierarchical decision-making, and safety-constrained control. All baseline methods are implemented within the same Gazebo + UUV simulation framework and adapted to a unified observation and action space. Only interface-level modifications are introduced to ensure compatibility with the proposed simulation environment, while the original algorithmic structures of all baselines are preserved.

Unified Observation Space: All methods receive identical state observations, including (1) vehicle position and orientation, (2) linear and angular velocities, (3) obstacle distance information extracted from the local cost map, (4) terrain-related features such as local depth and transition-region indicators, and (5) local environmental disturbance information. The observation dimensions are kept consistent across all methods to avoid performance differences caused by unequal environmental information.

Unified Action Space: All methods output continuous control commands consisting of forward velocity, steering/angular control, and thrust-related actuation signals. Action ranges are normalized to the same control bounds across all methods to ensure equivalent actuation capability.

Unified Reward Structure: To minimize bias introduced by reward engineering, a shared reward structure is adopted whenever possible. The reward function includes the following: goal-reaching reward, collision penalty, transition smoothness regularization, and energy-consumption penalty. Only minimal modifications required for algorithmic compatibility are introduced.

Unified Training and Evaluation Settings: All methods are trained and evaluated under identical simulation conditions, including the same environment layouts, obstacle configurations, cross-domain transition regions, disturbance settings, training episodes, and random-seed initialization strategy. No additional privileged information is provided to the proposed method.

Cross-Domain Amphibious Path Planning Baselines: IPPO [34] proposes an Improved Proximal Policy Optimization framework for global path planning of amphibious robots. It enhances PPO by integrating attention and recurrent modules to address discontinuous dynamics during medium switching, making it a representative baseline for cross-domain reinforcement learning-based navigation. For fair comparisons, IPPO is adapted to the unified continuous-control observation and action space while preserving its original policy architecture.

DDQN [35] introduces a global path planning algorithm based on Double Deep Q-Networks for multi-task amphibious robotic platforms. This work represents one of the early reinforcement learning solutions for amphibious navigation, serving as a fundamental value-based baseline for cross-medium global planning. Since DDQN originally employs discrete action selection, its action interface is discretized from the unified continuous control space while maintaining identical environmental observations.

HEA-PPO [36] combines a hyper-heuristic evolutionary algorithm with PPO to achieve energy-constrained collaborative path planning for heterogeneous amphibious robotic systems. It provides a hybrid evolutionary–learning strategy to handle multi-robot coordination and complex environmental constraints. The method is adapted using the same state observations and reward structure adopted in our framework.

IMTCMO [37] proposes an improved multitasking-constrained multi-objective optimization framework for multi-amphibious robotic collaboration in constrained environments. Unlike end-to-end learning approaches, IMTCMO focuses on constrained multi-objective optimization, providing a strong non-learning baseline for cross-domain path planning under multiple conflicting objectives. The planner receives the same terrain and obstacle information as all learning-based methods.

Learning-Based Water-Domain Navigation Baselines: APF-DQN [38] presents a hybrid artificial potential field–DQN framework enhanced with ocean current prediction for water-surface robotic navigation in dynamic environments. By integrating physical prior guidance with deep Q-learning, it serves as a representative baseline for physics-guided learning in water-domain navigation. The action space is discretized consistently with the DDQN baseline.

I-DDPG [39] proposes an improved deep deterministic policy gradient algorithm for continuous-action water-domain navigation, targeting control smoothness and reward shaping for dynamic environments. This method acts as a typical actor–critic continuous-action baseline for comparing control stability and convergence behavior. The original continuous-action architecture is preserved without structural modification.

MORL-based [40] designs a multi-objective reinforcement learning architecture for water-domain robotic navigation, employing ensemble decision mechanisms to balance safety, efficiency, and energy consumption. It provides a canonical baseline for multi-objective decision-making in learning-based navigation. The reward weights are normalized to align with the unified evaluation objectives.

Safety-Aware Collision Avoidance and Dynamic Decision Baselines: RLCA [41] introduces a reinforcement learning collision avoidance algorithm by explicitly incorporating maneuvering characteristics and rule-constrained navigation principles into the learning framework. This method forms a representative safety-aware baseline for rule-constrained collision avoidance in autonomous robotic navigation. The same obstacle and transition-region information are provided as environmental inputs.

APF-D3QNPER [42] proposes a hybrid deep learning architecture combining artificial potential fields, dueling double DQN, prioritized experience replay, and LSTM for navigation in unknown dynamic environments. It provides a strong baseline for dynamic obstacle avoidance with temporal memory and guided exploration. Its observation interface is standardized to the same local environmental representation used in our framework.

CLPPO-GIC [43] develops a CNN–LSTM–PPO framework with a generalized integral compensator mechanism for multi-agent autonomous collision avoidance. By integrating temporal feature extraction and state-error compensation into PPO, it serves as a representative baseline for sequential decision-making and dynamic interaction scenarios. The network structure remains unchanged while adopting the same simulation settings and action constraints.

Hierarchical Planning and Safety-Constrained Control Baselines: BarrierNet [28] proposes differentiable control barrier functions for learning safe robot control. By embedding a safety-filtering layer into policy optimization, it represents a representative baseline for safety-constrained continuous control and directly corresponds to the safety projection mechanism in our controller. The same safety constraints and control bounds are applied during evaluation.

pH-DRL [26] introduces a predictive hierarchical reinforcement learning framework for long-horizon navigation, where a high-level planner guides low-level controllers through predictive sub-goal generation. This method serves as a representative hierarchical decision-making baseline comparable to our Hierarchical Safe Switching Policy. The hierarchical interfaces are preserved while adapting the observation inputs to the unified environment representation.

MP-DQL [27] formulates motion primitives as the action space of deep Q-learning for autonomous driving planning. By integrating structured global planning with deep learning-based decision-making, it provides a strong baseline for comparing cross-domain global reachability planning and planning–learning joint optimization. Motion primitives are re-parameterized according to the amphibious vehicle dynamics while maintaining the original planning logic.

Overall, these baselines collectively cover cross-domain amphibious navigation, learning-based water-domain navigation, rule-constrained collision avoidance, multi-objective optimization, hierarchical decision-making, and safety-constrained control. By standardizing observation space, action space, reward structure, and environmental conditions, the proposed CD-HSSRL framework is evaluated under a consistent and reproducible experimental protocol, ensuring that performance differences primarily arise from algorithmic characteristics rather than inconsistent implementation settings.

4.4. Evaluation Metrics

To comprehensively evaluate the effectiveness of the proposed CD-HSSRL framework in cross-domain autonomous navigation and path planning, we adopt a set of quantitative metrics covering navigation success, safety performance, efficiency, and switching stability. All metrics are computed consistently for the proposed method and all baselines under identical experimental settings.

Success Rate (SR): The success rate measures the proportion of navigation trials in which the robot successfully reaches the target without collision or grounding:

SR = \frac{N_{success}}{N_{total}} .

(27)

Collision Rate (CR): The collision rate evaluates safety performance by measuring the frequency of collision or grounding events:

CR = \frac{N_{collision}}{N_{total}} .

(28)

Safety Violation Rate (SVR): To further assess safety-constrained control performance, we measure the frequency of safety constraint violations:

SVR = \frac{N_{violation}}{N_{total}},

(29)

where

N_{violation}

denotes episodes where safety constraints (collision, grounding, or forbidden-zone entry) are violated. This metric is particularly used to compare safety-aware baselines such as BarrierNet.

Average Path Length (APL): APL measures navigation efficiency by computing the average traveled path length:

APL = \frac{1}{N_{success}} \sum_{i = 1}^{N_{success}} L_{i} .

(30)

Average Navigation Time (ANT): ANT evaluates decision-making and planning efficiency by measuring the average time steps required to reach the target:

ANT = \frac{1}{N_{success}} \sum_{i = 1}^{N_{success}} T_{i},

(31)

where

T_{i}

denotes the completion time steps of episode i. This metric is mainly used to compare hierarchical planning and planning–learning baselines such as pH-DRL and MP-DQL.

Energy Consumption (EC): Energy consumption evaluates control efficiency by accumulating actuation energy along trajectories:

EC = \frac{1}{N_{success}} \sum_{i = 1}^{N_{success}} \sum_{t = 1}^{T_{i}} {∥ a_{t} ∥}^{2} .

(32)

Switching Stability Index (SSI): To quantify medium-switching stability across water–land transitions, we define a Switching Stability Index:

SSI = 1 - \frac{N_{switch}}{T_{total}} .

(33)

Cross-Domain Transition Success Rate (CTS): CTS evaluates the success probability of completing water–land or land–water transitions without failure:

CTS = \frac{N_{transition - success}}{N_{transition - attempt}} .

(34)

These metrics jointly evaluate global reachability, local safety, control efficiency, hierarchical decision-making performance, and cross-domain switching capability, providing a comprehensive assessment of the proposed CD-HSSRL framework against all baselines.

5. Results and Discussion

5.1. Overall Comparison with Representative Baselines

We first conduct a comprehensive comparison between the proposed CD-HSSRL framework and representative baselines on water-domain navigation, land-domain navigation, and cross-domain transition tasks. The evaluated baselines include IPPO, DDQN, HEA-PPO, IMTCMO, APF-DQN, I-DDPG, MORL-based, RLCA, APF-D3QNPER, and CLPPO-GIC and three recently added high-quality baselines: BarrierNet, pH-DRL, and MP-DQL. All methods are trained and tested under identical observation spaces, action spaces, reward functions, and environment settings to ensure fair comparison. For baselines that are originally defined with structured action spaces (e.g., MP-DQL) or safety-filtering layers (e.g., BarrierNet), we follow their original protocol while aligning the state representation and evaluation interface to our unified cross-domain navigation setting.

Overall Quantitative Results: Table 1 reports the overall performance on the WaterScenes, MVTD, BARN, and Gazebo cross-domain environments. For WaterScenes, MVTD, and BARN, we report the success rate (SR), collision rate (CR), average path length (APL), and energy consumption (EC). For the Gazebo cross-domain environment, we report the SR, CR, Switching Stability Index (SSI), and Cross-Domain Transition Success Rate (CTS), which directly measure medium-switching stability and transition robustness. The best results are highlighted in bold.

Visualization of SOTA Comparison: To provide an intuitive comparison, Figure 7 visualizes the SR and CR performance across different datasets. CD-HSSRL consistently achieves higher success rates and lower collision rates compared with all baselines, particularly in the Gazebo cross-domain environment, demonstrating its effective cross-medium decision-making and safety control capability.

Result Analysis: From Table 1 and Figure 7, several observations can be made.

First, on WaterScenes and MVTD, CD-HSSRL achieves favorable performance compared with USV-oriented baselines such as APF-DQN, I-DDPG, and RLCA, indicating that the proposed Safety-Constrained Continuous Controller effectively improves dynamic obstacle avoidance under complex maritime conditions. Moreover, compared with BarrierNet, CD-HSSRL achieves higher SR with comparable or lower CR, suggesting that jointly optimizing hierarchical switching with safety-aware control yields additional benefits beyond purely safety-filtered control.

Second, on the BARN benchmark, CD-HSSRL achieves comparable or better performance than land-navigation and hierarchical planning baselines such as IPPO, HEA-PPO, and pH-DRL, demonstrating that the low-level controller maintains stable control performance and the high-level policy supports effective long-horizon decision-making even without water-domain dynamics.

Third, in the Gazebo cross-domain environment, CD-HSSRL shows a consistently higher Cross-Domain Transition Success Rate (CTS) and Switching Stability Index (SSI) than amphibious baselines such as IPPO, DDQN, HEA-PPO, and IMTCMO, as well as newly added hierarchical and planning baselines (pH-DRL and MP-DQL). This verifies that the Hierarchical Safe Switching Policy and unified cross-domain reachability planner effectively handle discontinuous water–land dynamics. In addition, CD-HSSRL achieves the lowest CR among all compared methods, indicating that the safety-constrained controller is essential for preventing grounding and collisions during shoreline interaction.

Overall, these results confirm that CD-HSSRL achieves competitive performance across water-domain navigation, land-domain planning, and cross-domain transition tasks, validating the effectiveness of the proposed CD-HSSRL framework for autonomous amphibious robot navigation and path planning.

5.2. Cross-Domain Transition Performance

Since the primary contribution of CD-HSSRL lies in handling discontinuous water–land dynamics, we further conduct dedicated experiments to evaluate cross-domain transition performance in the Gazebo-based amphibious simulation environment. Three representative transition tasks are designed: (1) water to land (shoreline climbing), (2) land to water (water entry), and (3) multiple transitions (water–land–water). These tasks explicitly test global reachability planning, medium-switching stability, and safety-constrained control under realistic cross-domain physical interactions.

Baselines for Cross-Domain Evaluation: To ensure a fair and mechanism-consistent comparison, we select four representative baselines for cross-domain transition evaluation: IPPO as a reinforcement learning-based amphibious navigation method, HEA-PPO as an optimization-driven energy-constrained amphibious planner, RLCA as a rule-based safety-aware collision avoidance strategy, and BarrierNet as a differentiable safety-constrained control framework. These baselines respectively correspond to cross-domain policy learning, multi-objective optimization, rule-constrained safety control, and optimization-based safety filtering, thus providing comprehensive comparative perspectives for evaluating hierarchical switching and safety-constrained control in CD-HSSRL.

Quantitative Results: Table 2 summarizes cross-domain transition performance in terms of the Cross-Domain Transition Success Rate (CTS), Switching Stability Index (SSI), collision rate (CR), Safety Violation Rate (SVR), and energy consumption (EC).

Trajectory Visualization: To qualitatively illustrate cross-domain navigation behaviors, Figure 8 shows representative trajectories of CD-HSSRL, IPPO, and HEA-PPO in the water-to-land task. While IPPO and HEA-PPO often experience unstable mode switching or partial grounding near the shoreline due to the lack of explicit switching stability constraints, BarrierNet achieves safe but conservative shoreline behaviors with slower progress, whereas CD-HSSRL generates smooth transition trajectories and successfully reaches land targets without oscillatory control. Although some trajectories are longer, the proposed method generates smoother and safer transitions with reduced collision risk near domain boundaries. Figure 9 shows the position of the robot at different times from the starting point to the ending point.

Switching Sequence Analysis: To further examine switching stability, Figure 10 visualizes the temporal evolution of motion modes during cross-domain navigation. For clarity of temporal illustration, IPPO is selected as the representative reinforcement learning baseline, and BarrierNet is selected as the representative safety-filtering baseline. CD-HSSRL exhibits consistent and minimal mode switches, whereas IPPO shows frequent oscillations between water and transition modes, and BarrierNet tends to delay switching decisions due to conservative safety constraints, leading to reduced transition efficiency.

Result Analysis: From Table 2, CD-HSSRL achieves the highest CTS and SSI among all compared methods, indicating effective cross-domain transition robustness and stable medium-switching decisions. In particular, CD-HSSRL improves CTS by 7–15% over representative baselines IPPO, HEA-PPO, RLCA, and BarrierNet, demonstrating the effectiveness of the Hierarchical Safe Switching Policy. Moreover, the lowest CR and SVR confirm that the Safety-Constrained Continuous Controller successfully prevents grounding and collision events during shoreline interaction. Although BarrierNet maintains strong safety performance through explicit constraint enforcement, it exhibits higher energy consumption and slower transitions due to conservative action filtering.

Overall, these results verify that the proposed CD-HSSRL framework effectively addresses discontinuous cross-domain dynamics and achieves competitive performance in amphibious water–land transition tasks.

5.3. Ablation Studies

To investigate the contribution of each key component in CD-HSSRL, we conduct ablation experiments by selectively removing major modules from the proposed framework. All ablation variants are evaluated under the same Gazebo cross-domain transition tasks and MVTD dynamic obstacle scenarios, since these environments best reflect the core challenges of cross-domain switching and safety-aware control.

Ablation Settings: We design five representative ablation variants:

A1: w/o CD-GRP—removing the Cross-Domain Global Reachability planner, replacing it with a local greedy planner.
A2: w/o HSSP—removing the Hierarchical Safe Switching Policy and using a single flat policy.
A3: w/o Safety Projection—removing the safety-constrained action projection layer.
A4: w/o Risk-Sensitive Reward—removing the risk penalty term in reward shaping.
A5: w/o Switching Regularization—removing the switching stability loss $L_{s w}$ .

Quantitative Results: Table 3 reports the ablation results in terms of the Cross-Domain Transition Success Rate (CTS), Switching Stability Index (SSI), collision rate (CR), and energy consumption (EC).

Visualization of Ablation Impact: Figure 11 visualizes the impact of removing each module on CTS and CR. Removing HSSP and the switching regularization term causes significant degradation in SSI and CTS, while removing the safety projection layer leads to a sharp increase in collision rate. These observations highlight the necessity of hierarchical switching and explicit safety enforcement in cross-domain navigation.

Result Analysis: From Table 3, removing the Cross-Domain Global Reachability Planner (A1) reduces CTS by 8%, indicating that unified cross-domain cost-aware planning is essential for successful shoreline transitions. Removing the Hierarchical Safe Switching Policy (A2) results in unstable mode decisions and a significant drop in SSI, demonstrating the importance of structured option-based switching for discontinuous water–land dynamics. The absence of the Safety Projection layer (A3) causes CR to increase drastically, confirming that explicit constraint enforcement is critical for preventing grounding and collisions. Finally, removing the risk-sensitive reward or switching regularization (A4 and A5) leads to moderate but consistent performance degradation, showing that both safety-oriented reward shaping and switching stability loss contribute to robust and efficient navigation.

Overall, the ablation results verify that each proposed module plays a complementary and indispensable role in achieving robust cross-domain autonomous navigation.

5.4. Robustness Analysis

In real-world amphibious navigation, environmental disturbances, perception uncertainty, and scene complexity may significantly affect policy stability and safety. To evaluate the robustness of CD-HSSRL under such uncertainties, we conduct robustness experiments from three perspectives: (1) hydrodynamic disturbance intensity, (2) perception noise, and (3) obstacle density variation. All experiments are performed in the Gazebo cross-domain simulation and MVTD dynamic navigation environments.

For fair and representative comparison, IPPO and HEA-PPO are selected as representative amphibious navigation baselines, RLCA represents rule-based maritime safety control, BarrierNet represents optimization-based safety-constrained control, and pH-DRL represents hierarchical long-horizon decision-making. These baselines respectively cover reinforcement learning-based cross-domain navigation, optimization-driven planning, rule-constrained safety control, safety-filtering control, and hierarchical planning, thus providing comprehensive perspectives for evaluating the robustness of CD-HSSRL.

R1: Hydrodynamic Disturbance. We vary water current velocity in the Gazebo environment from 0 to 1.5 m/s to simulate calm to strong flow conditions. Table 4 reports the success rate (SR) and collision rate (CR) under different current intensities.

R2: Perception Noise: To simulate sensor uncertainty, Gaussian noise with increasing variance is added to observation features extracted from WaterScenes and MVTD. Table 5 presents Cross-Domain Transition Success Rate (CTS) under different noise levels.

R3: Obstacle Density: We further increase the number of dynamic obstacles in MVTD and Gazebo environments to evaluate navigation robustness under crowded scenes. For long-horizon planning robustness comparison, pH-DRL is included as a representative hierarchical decision-making baseline. Figure 12 illustrates SR degradation trends as obstacle density increases.

Result Analysis: From Table 4 and Table 5, CD-HSSRL consistently maintains higher SR and CTS and lower CR than all compared baselines under different disturbance levels. Notably, BarrierNet achieves relatively low collision rates due to conservative safety filtering, but its success rate degrades faster under strong currents and high perception noise, indicating limited adaptability to dynamic cross-domain disturbances. Meanwhile, pH-DRL shows more stable long-horizon planning under increased obstacle density, but it still suffers from switching oscillations during water–land transitions.

We further analyze representative failure cases observed during experiments. Failure typically occurs in (1) strong dynamic disturbances near transition regions, (2) ambiguous domain boundaries, and (3) high levels of sensor noise. These cases reveal that the switching mechanism may become unstable under rapidly changing conditions, leading to suboptimal decisions.

Overall, CD-HSSRL demonstrates effective robustness against hydrodynamic disturbances, perception uncertainty, and scene complexity, confirming that hierarchical safe switching and safety-constrained continuous control jointly contribute to stable and robust cross-domain navigation.

5.5. Parameter Sensitivity Analysis

The proposed CD-HSSRL framework introduces several key hyperparameters that control cross-domain switching stability, safety-constrained optimization, and terrain-aware navigation behavior. To verify that the performance improvements are not overly dependent on specific parameter settings, we conduct sensitivity analysis on four representative parameters: (1) switching regularization coefficient

λ_{s w}

, (2) safety projection penalty coefficient

λ_{s a f e}

, (3) hierarchical option termination threshold

κ

, and (4) cost-map weighting coefficient

λ_{c o s t}

.

All experiments are conducted in the Gazebo + UUV cross-domain simulation environment using both water-to-land and multi-transition navigation tasks.

P1: Switching Regularization Coefficient

λ_{sw}

: The coefficient

λ_{sw}

controls the strength of the switching stability loss introduced in the Hierarchical Safe Switching Policy. We vary

λ_{sw}

from 0 to 1.0 and report CTS and SSI in Table 6.

P2: Safety Projection Penalty

λ_{safe}

: The parameter

λ_{safe}

weights the constraint violation penalty in the Safety-Constrained Continuous Controller. We vary

λ_{safe}

from 0.1 to 2.0 and report the collision rate (CR) and energy consumption (EC) in Table 7.

P3: Option Termination Threshold

κ

: The threshold

κ

determines when the high-level policy terminates a motion option and triggers medium switching. We vary

κ

from 0.3 to 0.9 and evaluate the Cross-Domain Transition Success Rate (CTS). Figure 13 visualizes the CTS variation trend.

P4: Cost-Map Weighting Coefficient

λ_{cost}

. The weighting coefficient

λ_{cost}

controls the influence of terrain-aware traversal costs in the global cost-map representation. Larger values encourage the agent to avoid risky transition regions and obstacle-dense areas, while smaller values prioritize shorter trajectories with weaker terrain-awareness. We vary

λ_{cost}

from 0.1 to 2.0 and evaluate the success rate (SR), collision rate (CR), and Average Path Length (PL). The results are summarized in Table 8.

Result Analysis: From Table 6 and Table 7, CD-HSSRL achieves the best balance between switching stability, collision avoidance, and energy efficiency when

λ_{sw} = 0.5

and

λ_{safe} = 1.0

. Excessively small

λ_{sw}

leads to frequent mode oscillations, while overly large values reduce responsiveness near transition regions. Similarly, insufficient safety penalties increase collision risk, whereas excessively large

λ_{safe}

values result in overly conservative behaviors and increased energy consumption.

Table 8 further demonstrates that the proposed framework is moderately sensitive to the cost-map weighting coefficient. When

λ_{cost}

is too small, the agent tends to prioritize shorter paths while neglecting terrain risks, leading to higher collision rates and unstable cross-domain transitions. Conversely, excessively large

λ_{cost}

values encourage overly conservative navigation behaviors, resulting in longer trajectories and reduced navigation efficiency. The best overall trade-off is achieved near

λ_{cost} = 0.5

, which balances terrain awareness, safety, and path efficiency.

Figure 13 shows that CTS remains relatively stable across a broad range of

κ

values, indicating that CD-HSSRL is not overly sensitive to precise option termination threshold tuning.

Overall, the parameter sensitivity analysis demonstrates that CD-HSSRL maintains stable and robust performance across a wide range of hyperparameter configurations, confirming the robustness, interpretability, and reproducibility of the proposed framework.

5.6. Computational Cost and Scalability

The proposed CD-HSSRL framework introduces additional computational complexity compared to conventional single-policy reinforcement learning methods due to its hierarchical structure and modular components.

From an inference perspective, the framework consists of a high-level policy, multiple low-level controllers, and a safety projection module. However, these components operate at different temporal scales. The high-level policy is executed at a lower frequency to select sub-tasks or domains, while the low-level controller generates control commands at a higher frequency. As a result, the additional computational overhead during execution remains manageable for real-time applications.

In terms of training cost, the framework requires training multiple policies, which increases the total training time and computational resources. Nevertheless, this design improves learning efficiency in complex cross-domain environments by decomposing the task into more manageable sub-problems, leading to more stable convergence.

Regarding scalability, the modular architecture of the framework facilitates extension to more complex or multi-domain scenarios. New domains can be incorporated by introducing additional domain-specific policies without fundamentally modifying the overall structure. However, this scalability is partly constrained by the need for domain-specific knowledge, such as cost maps or environment annotations.

Overall, the proposed framework represents a trade-off between computational cost and performance, prioritizing robustness and adaptability in heterogeneous environments.

5.7. Discussion of Findings and Limitations

This study proposed CD-HSSRL, a Cross-Domain Hierarchical Safe-Switching Reinforcement Learning framework for autonomous amphibious robot navigation. Comprehensive experiments demonstrated that CD-HSSRL consistently outperforms representative baselines in water-domain navigation, land-domain planning, and water–land transition tasks. The results indicate that the Cross-Domain Global Reachability Planner effectively unifies heterogeneous environmental cost representations, the Hierarchical Safe Switching Policy enables stable medium-transition decisions, and the Safety-Constrained Continuous Controller effectively reduces collision risks during complex shoreline interactions.

Beyond overall performance gains, comparative experiments against recent high-quality baselines provide deeper insights. Safety-filtering methods such as BarrierNet achieve strong collision avoidance performance, yet they exhibit conservative behaviors and reduced transition efficiency. Hierarchical planning approaches such as pH-DRL and structured planning–learning methods such as MP-DQL demonstrate improved long-horizon decision-making, but they still suffer from unstable medium switching under discontinuous water–land dynamics. By jointly optimizing global reachability planning, hierarchical switching, and safety-constrained control through joint hierarchical optimization, CD-HSSRL overcomes these limitations and achieves a better balance between safety, stability, and navigation efficiency.

The experimental observations further suggest that explicitly modeling medium-switching stability is crucial for discontinuous cross-domain dynamics, where flat or purely hierarchical policies commonly suffer from oscillatory decisions near boundary regions. Moreover, integrating differentiable safety projection into continuous control not only improves collision avoidance but also enhances policy generalization under environmental uncertainties. These findings imply that hierarchical decision decomposition combined with constraint-aware control constitutes a promising paradigm for cross-domain robotic navigation beyond amphibious scenarios.

Despite the promising performance of the proposed CD-HSSRL framework, several limitations should be acknowledged. First, the current validation is conducted entirely in a simulation environment based on Gazebo and the UUV Simulator. Although the simulator incorporates hydrodynamic effects and provides a controllable and reproducible testing platform, it cannot fully capture the complexity and uncertainty of real-world amphibious environments. Factors such as unmodeled disturbances, sensor imperfections, and hardware constraints may affect real-world performance. Future work will focus on transferring the proposed framework to physical platforms and investigating sim-to-real adaptation strategies. Second, the proposed framework relies on several manually designed components, including cost maps and explicit domain labels (e.g., water, land, and transition regions). While these elements improve interpretability and control, they limit the level of autonomy and may reduce generalization to unseen environments where such prior knowledge is unavailable or inaccurate. Third, the method is primarily designed as an engineering-oriented system integration and does not provide formal theoretical guarantees regarding convergence, safety, or switching stability. Although empirical results demonstrate improved performance, a rigorous theoretical analysis would further strengthen the robustness of the framework. Fourth, the hierarchical structure and safety mechanisms introduce additional computational overhead compared to single-policy reinforcement learning approaches. This may limit real-time applicability in resource-constrained systems, particularly for high-frequency control tasks. Finally, the dataset-to-environment conversion process may introduce bias due to simplifications and assumptions made during mapping. Differences between the generated simulation scenarios and real-world environments may affect the generalization capability of the learned policies. Addressing these limitations constitutes an important direction for future research, including improving environment realism, reducing reliance on manual design, enhancing computational efficiency, and validating the framework in real-world deployments.

6. Conclusions

This paper investigated the problem of autonomous cross-domain navigation for amphibious robotic systems operating in heterogeneous water–land environments. The main research question addressed in this work is whether a hierarchical reinforcement learning framework with adaptive switching and safety-aware control can improve navigation stability and robustness under discontinuous dynamics. To address this problem, we proposed the CD-HSSRL framework, which integrates hierarchical decision-making, safety projection, and adaptive switching mechanisms into a unified navigation architecture.

The experimental results in the Gazebo + UUV simulation environment demonstrate that the proposed method achieves favorable performance compared with baseline approaches, achieving higher success rates and lower collision rates across water, land, and transition environments. In particular, in cross-domain scenarios, the proposed method improves the success rate by approximately 20% compared to conventional RL methods while maintaining stable performance under environmental disturbances. These results indicate that the proposed framework is effective for handling heterogeneous dynamics and complex navigation tasks.

However, the current study is limited to simulation-based validation, and future work will focus on real-world experiments and sim-to-real transfer. Sim-to-real transfer represents a complementary research direction, and integrating such techniques into the proposed framework is an important avenue for future work.

Author Contributions

S.L.: Conceptualization, methodology, framework design, algorithm development, writing—original draft. L.W.: Data curation, benchmark construction, experiment implementation, validation, and visualization. X.L.: Supervision, project administration, writing—review and editing, and corresponding author. All authors have read and agreed to the published version of this manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study were sourced from three publicly available repositories: WaterScenes dataset, available at https://github.com/WaterScenes/WaterScenes/tree/main (accessed on 28 March 2024); Maritime visual tracking dataset, available at https://github.com/AhsanBaidar/MVTD (accessed on 20 May 2025); and BARN ground navigation benchmark, available at https://www.cs.utexas.edu/~xiao/BARN/BARN_dataset.zip (accessed on 6 November 2020). The source code presented in this study is available at GitHub https://github.com/ls142968/CD-HSSRL.git (accessed on 28 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CD-HSSRL	Cross-Domain Hierarchical Safe-Switching Reinforcement Learning;
CD-GRP	Cross-Domain Global Reachability Planner;
HSSP	Hierarchical Safe Switching Policy;
SCCC	Safety-Constrained Continuous Controller.

References

Ackleson, S.G. Robotic Surveyors for Shallow Coastal Environments. Oceanography 2021, 34, 96–97. [Google Scholar] [CrossRef]
Bogue, R. The role of robots in environmental monitoring. Ind. Robot. Int. J. Robot. Res. Appl. 2023, 50, 369–375. [Google Scholar] [CrossRef]
Narouz, A.S.; Ismail, A.; Atef, A.; Magdy, M.; Abdallah, M.; Atwa, M.; Shenoda, S.; Elsayed, M.; Ayman, S.; Ahmed, M.I. A Review of Features and Characteristics of Rescue Robot with AI. Adv. Sci. Technol. J. 2024, 1, 1–18. [Google Scholar] [CrossRef]
Sahoo, S.K.; Choudhury, B.B.; Dhal, P.R. Exploring the role of robotics in maritime technology: Innovations, challenges, and future prospects. Spectr. Mech. Eng. Oper. Res. 2024, 1, 159–176. [Google Scholar] [CrossRef]
Li, Q.; Li, H.; Shen, H.; Yu, Y.; He, H.; Feng, X.; Sun, Y.; Mao, Z.; Chen, G.; Tian, Z.; et al. An aerial–wall robotic insect that can land, climb, and take off from vertical surfaces. Research 2023, 6, 0144. [Google Scholar] [CrossRef]
Wijayathunga, L.; Rassau, A.; Chai, D. Challenges and solutions for autonomous ground robot scene understanding and navigation in unstructured outdoor environments: A review. Appl. Sci. 2023, 13, 9877. [Google Scholar] [CrossRef]
Amundsen, H.B.; Randeni, S.; Bingham, R.C.; Civit, C.; Filardo, B.P.; Føre, M.; Kelasidi, E.; Benjamin, M.R. Hybrid State Estimation and Mode Identification of an Amphibious Robot. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2025; pp. 12696–12702. [Google Scholar]
Shi, L.; Zhang, Z.; Li, Z.; Guo, S.; Pan, S.; Bao, P.; Duan, L. Design, implementation and control of an amphibious spherical robot. J. Bionic Eng. 2022, 19, 1736–1757. [Google Scholar] [CrossRef]
Zhang, D.; Van, M.; Mcllvanna, S.; Sun, Y.; McLoone, S. Adaptive safety-critical control with uncertainty estimation for human–robot collaboration. IEEE Trans. Autom. Sci. Eng. 2023, 21, 5983–5996. [Google Scholar] [CrossRef]
Liang, D.; Huang, X.; Xue, Z.; Li, P. Path planning for amphibious unmanned ground vehicles under cross-domain constraints. Intell. Serv. Robot. 2025, 18, 1381–1416. [Google Scholar] [CrossRef]
Corsi, D.; Camponogara, D.; Farinelli, A. Aquatic navigation: A challenging benchmark for deep reinforcement learning. arXiv 2024, arXiv:2405.20534. [Google Scholar] [CrossRef]
Zhu, Y.; Wan Hasan, W.Z.; Harun Ramli, H.R.; Norsahperi, N.M.H.; Mohd Kassim, M.S.; Yao, Y. Deep reinforcement learning of mobile robot navigation in dynamic environment: A review. Sensors 2025, 25, 3394. [Google Scholar] [CrossRef]
Zhong, G.; Lu, X.; Deng, T.; Cao, J. Multimodal amphibious robotics: Co-design of hybrid propulsion system and quaternion-based adaptive control for cross-domain transitions. Control Eng. Pract. 2026, 167, 106644. [Google Scholar] [CrossRef]
Xia, H.; Xu, Y.; Li, Z. Hybrid actuators and their reuse methodologies for amphibious robots. Robot. Intell. Autom. 2025, 45, 465–480. [Google Scholar] [CrossRef]
Cuevas, J.K.; Dionisio, D.A.I.; Dris, M.K.; Flores, B.F.; Romana, C.J.S.; Bautista, A.J. Retrofitting a Commercially Available Remote Controlled Boat into an Amphibious Robot for Flood Operations Rescue Surveillance (FlOReS) Assistance. In Proceedings of the 2024 9th International Conference on Control and Robotics Engineering (ICCRE); IEEE: New York, NY, USA, 2024; pp. 39–44. [Google Scholar]
Policarpo, H.; Lourenço, J.P.; Anastácio, A.M.; Parente, R.; Rego, F.; Silvestre, D.; Afonso, F.; Maia, N.M. Conceptual design of an unmanned electrical amphibious vehicle for ocean and land surveillance. World Electr. Veh. J. 2024, 15, 279. [Google Scholar] [CrossRef]
Zhang, K.; Ye, Y.; Chen, K.; Li, Z.; Li, K. Enhanced AUV autonomy through fused energy-optimized path planning and deep reinforcement learning for integrated navigation and dynamic obstacle detection. J. Mar. Sci. Eng. 2025, 13, 1294. [Google Scholar] [CrossRef]
Zhu, A.; Zhao, J.; Yang, L. Multimodal magnetic miniature robot for adaptive navigation in amphibious environments. npj Robot. 2025, 3, 42. [Google Scholar] [CrossRef]
Duan, M. Attention-based multi-agent reinforcement learning for traffic flow stability in mountainous tunnel entrances. Sci. Rep. 2025, 15, 37278. [Google Scholar] [CrossRef]
Politi, E.; Stefanidou, A.; Chronis, C.; Dimitrakopoulos, G.; Varlamis, I. Adaptive deep reinforcement learning for efficient 3D navigation of autonomous underwater vehicles. IEEE Access 2024, 12, 178209–178221. [Google Scholar] [CrossRef]
Mackay, A.K.; Riazuelo, L.; Montano, L. RL-DOVS: Reinforcement learning for autonomous robot navigation in dynamic environments. Sensors 2022, 22, 3847. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Guo, S.; Li, C.; Huang, L. Study on 3D Path Planning of AUV Based on the Reinforcement Learning Method. In Proceedings of the 2025 IEEE International Conference on Mechatronics and Automation (ICMA); IEEE: New York, NY, USA, 2025; pp. 821–826. [Google Scholar]
Mou, J.; Shi, B.; Wang, B.; Yu, C.; Wang, Y.; Zhong, F.; Zheng, L.; Wang, J.; Li, J. A novel reinforcement learning framework-based path planning algorithm for unmanned surface vehicle. Front. Mar. Sci. 2025, 12, 1641093. [Google Scholar] [CrossRef]
Tao, B.; Kim, J.H. Deep reinforcement learning-based local path planning in dynamic environments for mobile robot. J. King Saud. Univ. Comput. Inf. Sci. 2024, 36, 102254. [Google Scholar] [CrossRef]
Eppe, M.; Gumbsch, C.; Kerzel, M.; Nguyen, P.D.; Butz, M.V.; Wermter, S. Intelligent problem-solving as integrated hierarchical reinforcement learning. Nat. Mach. Intell. 2022, 4, 11–20. [Google Scholar] [CrossRef]
Li, H.; Luo, B.; Song, W.; Yang, C. Predictive hierarchical reinforcement learning for path-efficient mapless navigation with moving target. Neural Netw. 2023, 165, 677–688. [Google Scholar] [CrossRef] [PubMed]
Schneider, T.; Pedrosa, M.V.; Gros, T.P.; Wolf, V.; Flaßkamp, K. Motion Primitives as the Action Space of Deep Q-Learning for Planning in Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17852–17864. [Google Scholar] [CrossRef]
Xiao, W.; Wang, T.H.; Hasani, R.; Chahine, M.; Amini, A.; Li, X.; Rus, D. Barriernet: Differentiable control barrier functions for learning of safe robot control. IEEE Trans. Robot. 2023, 39, 2289–2307. [Google Scholar] [CrossRef]
Da, L.; Turnau, J.; Kutralingam, T.P.; Velasquez, A.; Shakarian, P.; Wei, H. A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models. arXiv 2025, arXiv:2502.13187. [Google Scholar]
Ampuero, G.C.; Hermosilla, G.; Varas, G.; Clark, M.T. Deep Reinforcement Learning for Sim-to-Real Robot Navigation with a Minimal Sensor Suite for Beach-Cleaning Applications. Appl. Sci. 2025, 15, 10719. [Google Scholar] [CrossRef]
Yao, S.; Guan, R.; Wu, Z.; Ni, Y.; Huang, Z.; Liu, R.W.; Yue, Y.; Ding, W.; Lim, E.G.; Seo, H.; et al. WaterScenes: A Multi-Task 4D Radar-Camera Fusion Dataset and Benchmarks for Autonomous Driving on Water Surfaces. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16584–16598. [Google Scholar] [CrossRef]
Bakht, A.B.; Din, M.U.; Javed, S.; Hussain, I. MVTD: A Benchmark Dataset for Maritime Visual Object Tracking. arXiv 2025, arXiv:2506.02866. [Google Scholar] [CrossRef]
Perille, D.; Truong, A.; Xiao, X.; Stone, P. Benchmarking Metric Ground Navigation. In Proceedings of the 2020 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR); IEEE: New York, NY, USA, 2020. [Google Scholar]
Jiang, W.; Liu, J.; Wang, W.; Wang, Y. Global Path Planning for Land–Air Amphibious Biomimetic Robot Based on Improved PPO. Biomimetics 2026, 11, 25. [Google Scholar] [CrossRef]
Xiaofei, Y.; Yilun, S.; Wei, L.; Hui, Y.; Weibo, Z.; Zhengrong, X. Global path planning algorithm based on double DQN for multi-tasks amphibious unmanned surface vehicle. Ocean Eng. 2022, 266, 112809. [Google Scholar] [CrossRef]
Yin, S.; Xiang, Z. Energy-constrained collaborative path planning for heterogeneous amphibious unmanned surface vehicles in obstacle-cluttered environments. Ocean Eng. 2025, 330, 121241. [Google Scholar] [CrossRef]
Yin, S.; Hu, J.; Xiang, Z. Multi-objective collaborative path planning for multiple water-air unmanned vehicles in cramped environments. Expert Syst. Appl. 2025, 292, 128625. [Google Scholar] [CrossRef]
Zhang, N.; Chen, Y.; Wu, Y.; Ji, M.; Wang, B. A hybrid APF-DQN framework with transformer-based current prediction for USV path planning in dynamic ocean environments. Sci. Rep. 2025, 16, 3575. [Google Scholar] [CrossRef]
Hua, M.; Zhou, W.; Cheng, H.; Chen, Z. Improved DDPG algorithm-based path planning for unmanned surface vehicles. Intell. Robot. 2024, 4, 363–384. [Google Scholar] [CrossRef]
Yang, C.; Zhao, Y.; Cai, X.; Wei, W.; Feng, X.; Zhou, K. Path planning algorithm for unmanned surface vessel based on multiobjective reinforcement learning. Comput. Intell. Neurosci. 2023, 2023, 2146314. [Google Scholar] [CrossRef]
Fan, Y.; Sun, Z.; Wang, G. A novel reinforcement learning collision avoidance algorithm for USVs based on maneuvering characteristics and COLREGs. Sensors 2022, 22, 2099. [Google Scholar] [CrossRef] [PubMed]
Hu, H.; Wang, Y.; Tong, W.; Zhao, J.; Gu, Y. Path planning for autonomous vehicles in unknown dynamic environment based on deep reinforcement learning. Appl. Sci. 2023, 13, 10056. [Google Scholar] [CrossRef]
Liang, C.; Liu, L.; Liu, C. Multi-UAV autonomous collision avoidance based on PPO-GIC algorithm with CNN–LSTM fusion network. Neural Netw. 2023, 162, 21–33. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structure and sensor layout of tracked-thruster amphibious robot.

Figure 2. Overall architecture of the proposed CD-HSSRL framework.

Figure 3. Cross-Domain Global Reachability Planner (CD-GRP).

Figure 4. Hierarchical Safe Switching Policy (HSSP).

Figure 5. Safety-Constrained Continuous Controller (SCCC).

Figure 6. Schematic diagram of the overall experimental process.

Figure 7. Comparison of success rate (SR) and collision rate (CR) between CD-HSSRL and baselines on different datasets.

Figure 8. Representative water-to-land transition trajectories. CD-HSSRL achieves stable and efficient shoreline climbing, while IPPO and HEA-PPO exhibit unstable switching behaviors, and BarrierNet shows conservative but safe transitions.

Figure 9. Diagram of robot position changes over time.

Figure 10. Temporal switching sequence comparison. CD-HSSRL produces stable water → transition → land switching without oscillations, while IPPO oscillates and BarrierNet delays switching due to conservative safety filtering.

Figure 11. Ablation study on key CD-HSSRL components. Performance degradation is observed when removing cross-domain planning, hierarchical switching, or safety projection modules.

Figure 12. Success rate under increasing obstacle density. CD-HSSRL exhibits slower performance degradation than IPPO, HEA-PPO, RLCA, and pH-DRL.

Figure 13. Sensitivity of CTS under different option termination thresholds

κ

.

Figure 13. Sensitivity of CTS under different option termination thresholds

κ

.

Table 1. Quantitative comparison of navigation performance across multiple environments. The best results are highlighted in bold.

Method	WaterScenes				MVTD				BARN				Gazebo Cross-Domain
Method	SR ↑	CR ↓	APL ↓	EC ↓	SR ↑	CR ↓	APL ↓	EC ↓	SR ↑	CR ↓	APL ↓	EC ↓	SR ↑	CR ↓	SSI ↑	CTS ↑
IPPO	0.86	0.09	128.4	34.8	0.79	0.13	142.6	39.5	0.90	0.06	116.9	30.7	0.78	0.14	0.78	0.71
DDQN	0.83	0.11	132.7	36.2	0.76	0.15	149.8	41.1	0.88	0.07	120.8	32.1	0.74	0.17	0.75	0.66
HEA-PPO	0.88	0.08	124.9	33.7	0.81	0.12	140.9	38.6	0.91	0.05	114.2	29.6	0.80	0.13	0.80	0.74
IMTCMO	0.87	0.08	126.1	33.9	0.80	0.12	143.3	38.9	0.92	0.05	112.7	29.2	0.79	0.13	0.81	0.73
APF-DQN	0.89	0.07	123.8	32.4	0.83	0.11	137.2	37.5	0.85	0.08	126.5	34.9	0.76	0.15	0.77	0.69
I-DDPG	0.87	0.08	125.6	33.1	0.82	0.11	138.9	36.8	0.86	0.08	124.1	33.6	0.75	0.16	0.76	0.68
MORL-based	0.88	0.08	124.4	32.8	0.82	0.11	139.4	37.2	0.87	0.07	122.9	33.0	0.76	0.15	0.77	0.70
RLCA	0.86	0.06	129.6	35.4	0.80	0.09	145.2	40.1	0.84	0.06	127.8	35.9	0.77	0.10	0.79	0.72
APF-D3QNPER	0.90	0.07	121.9	33.6	0.84	0.10	135.8	39.2	0.86	0.07	124.6	34.1	0.78	0.13	0.80	0.74
CLPPO-GIC	0.89	0.07	122.7	33.0	0.85	0.10	134.9	38.4	0.88	0.06	120.6	32.5	0.81	0.12	0.83	0.76
BarrierNet	0.90	0.06	121.5	32.9	0.84	0.09	134.2	37.9	0.89	0.05	118.9	31.2	0.82	0.09	0.85	0.79
pH-DRL	0.88	0.07	124.8	33.4	0.83	0.10	137.1	38.1	0.91	0.05	114.6	29.8	0.84	0.10	0.86	0.81
MP-DQL	0.87	0.08	125.9	34.1	0.82	0.11	138.7	38.8	0.90	0.06	115.8	30.4	0.83	0.10	0.85	0.80
CD-HSSRL (Ours)	0.93	0.05	118.6	30.8	0.88	0.08	129.7	34.6	0.94	0.04	108.9	27.8	0.87	0.08	0.90	0.86

Note: ↑ indicates higher values are better; ↓ indicates lower values are better.

Table 2. Cross-domain transition performance in Gazebo amphibious simulation. The best results are highlighted in bold.

Method	CTS ↑	SSI $↑$	CR ↓	SVR ↓	EC ↓
IPPO	0.71	0.78	0.14	0.18	36.2
HEA-PPO	0.74	0.80	0.13	0.16	35.1
RLCA	0.72	0.79	0.10	0.12	39.8
BarrierNet	0.79	0.83	0.09	0.08	34.6
CD-HSSRL (Ours)	0.86	0.90	0.08	0.05	31.6

Note: ↑ indicates higher values are better; ↓ indicates lower values are better.

Table 3. Ablation study results on Gazebo cross-domain environment. The best results are highlighted in bold.

Method	CTS ↑	SSI ↑	CR ↓	EC ↓
Full CD-HSSRL (Ours)	0.86	0.90	0.08	31.6
A1: w/o CD-GRP	0.78	0.84	0.12	35.9
A2: w/o HSSP	0.73	0.70	0.15	34.8
A3: w/o Safety Projection	0.69	0.72	0.22	30.9
A4: w/o Risk-Sensitive Rwd	0.76	0.82	0.14	33.7
A5: w/o Switching Reg.	0.74	0.69	0.16	32.8

Note: ↑ indicates higher values are better; ↓ indicates lower values are better.

Table 4. Robustness against hydrodynamic disturbances.

Current (m/s)	IPPO		HEA-PPO		RLCA		BarrierNet		CD-HSSRL
Current (m/s)	SR ↑	CR ↓	SR ↑	CR ↓	SR ↑	CR ↓	SR ↑	CR ↓	SR ↑	CR ↓
0.0	0.78	0.12	0.80	0.11	0.76	0.08	0.83	0.07	0.87	0.08
0.5	0.74	0.15	0.77	0.13	0.73	0.09	0.81	0.08	0.85	0.09
1.0	0.69	0.19	0.72	0.17	0.68	0.11	0.77	0.10	0.82	0.11
1.5	0.63	0.24	0.66	0.22	0.62	0.14	0.72	0.13	0.78	0.14

Note: ↑ indicates higher values are better; ↓ indicates lower values are better.

Table 5. Robustness against perception noise.

Noise Std.	IPPO	HEA-PPO	RLCA	BarrierNet	CD-HSSRL
0.0	0.71	0.74	0.72	0.79	0.86
0.1	0.68	0.71	0.69	0.77	0.84
0.2	0.63	0.67	0.65	0.74	0.81
0.3	0.58	0.62	0.60	0.70	0.77

Table 6. Sensitivity analysis on switching regularization coefficient

λ_{sw}

.

Table 6. Sensitivity analysis on switching regularization coefficient

λ_{sw}

.

$λ_{sw}$	CTS ↑	SSI ↑
0.0	0.74	0.69
0.2	0.80	0.81
0.5	0.86	0.90
0.8	0.85	0.89
1.0	0.83	0.87

Note: ↑ indicates higher values are better; ↓ indicates lower values are better.

Table 7. Sensitivity analysis on safety projection penalty

λ_{safe}

.

Table 7. Sensitivity analysis on safety projection penalty

λ_{safe}

.

$λ_{safe}$	CR ↓	EC ↓
0.1	0.18	29.7
0.5	0.12	30.8
1.0	0.08	31.6
1.5	0.08	33.2
2.0	0.07	35.4

Note: ↑ indicates higher values are better; ↓ indicates lower values are better.

Table 8. Sensitivity analysis on cost-map weighting coefficient

λ_{cost}

.

Table 8. Sensitivity analysis on cost-map weighting coefficient

λ_{cost}

.

$λ_{cost}$	SR ↑	CR ↓	PL ↓
0.1	0.77	0.16	21.8
0.3	0.82	0.11	22.9
0.5	0.86	0.08	24.1
1.0	0.85	0.07	26.4
2.0	0.81	0.06	29.3

Note: ↑ indicates higher values are better; ↓ indicates lower values are better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Wei, L.; Li, X. CD-HSSRL: Cross-Domain Hierarchical Safe Switching Reinforcement Learning Framework for Autonomous Amphibious Robot Navigation. J. Mar. Sci. Eng. 2026, 14, 859. https://doi.org/10.3390/jmse14090859

AMA Style

Liu S, Wei L, Li X. CD-HSSRL: Cross-Domain Hierarchical Safe Switching Reinforcement Learning Framework for Autonomous Amphibious Robot Navigation. Journal of Marine Science and Engineering. 2026; 14(9):859. https://doi.org/10.3390/jmse14090859

Chicago/Turabian Style

Liu, Shuang, Lei Wei, and Xiaoqing Li. 2026. "CD-HSSRL: Cross-Domain Hierarchical Safe Switching Reinforcement Learning Framework for Autonomous Amphibious Robot Navigation" Journal of Marine Science and Engineering 14, no. 9: 859. https://doi.org/10.3390/jmse14090859

APA Style

Liu, S., Wei, L., & Li, X. (2026). CD-HSSRL: Cross-Domain Hierarchical Safe Switching Reinforcement Learning Framework for Autonomous Amphibious Robot Navigation. Journal of Marine Science and Engineering, 14(9), 859. https://doi.org/10.3390/jmse14090859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CD-HSSRL: Cross-Domain Hierarchical Safe Switching Reinforcement Learning Framework for Autonomous Amphibious Robot Navigation

Abstract

1. Introduction

2. Related Work

2.1. Amphibious and Cross-Domain Robot Navigation

2.2. Reinforcement Learning for Water-Domain and Land-Domain Navigation

2.3. Hierarchical Reinforcement Learning and Medium-Switching Decision Making

2.4. Safe Reinforcement Learning and Constraint-Aware Robotic Control

3. Method

3.1. Problem Formulation

3.2. Platform Description

3.3. Overall Framework of CD-HSSRL

3.4. Cross-Domain Global Reachability Planner

3.5. Hierarchical Safe Switching Policy

3.6. Safety-Constrained Continuous Controller

3.7. Training Objective and Optimization

3.8. Algorithm Pseudocode

4. Experiments

4.1. Datasets and Experimental Settings

4.2. Implementation Details

4.3. Baselines

4.4. Evaluation Metrics

5. Results and Discussion

5.1. Overall Comparison with Representative Baselines

5.2. Cross-Domain Transition Performance

5.3. Ablation Studies

5.4. Robustness Analysis

5.5. Parameter Sensitivity Analysis

5.6. Computational Cost and Scalability

5.7. Discussion of Findings and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI