Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation

İnner, Alpaslan Burak; Chachoua, Mohammed E.

doi:10.3390/app16031242

Open AccessArticle

Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation

by

Alpaslan Burak İnner

^1,*

and

Mohammed E. Chachoua

²

¹

Computer Engineering Department, Kocaeli University, İzmit 41001, Kocaeli, Türkiye

²

Electronics and Communication Engineering Department, Kocaeli University, İzmit 41001, Kocaeli, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1242; https://doi.org/10.3390/app16031242

Submission received: 6 October 2025 / Revised: 11 December 2025 / Accepted: 23 December 2025 / Published: 26 January 2026

(This article belongs to the Special Issue Next-Generation Mobile Robotics: Intelligent Navigation, Adaptive Planning, and Sensor Integration)

Download

Browse Figures

Versions Notes

Abstract

Autonomous navigation in unknown environments demands policies that can jointly perceive semantic context and geometric safety. Existing Transformer-enabled deep reinforcement learning (DRL) frameworks, such as the Goal-guided Transformer Soft Actor–Critic (GoT-SAC), rely on temporal stacking of multiple RGB frames, which encodes short-term motion cues but lacks explicit spatial understanding. This study introduces a geometry-aware RGB-D early fusion modality that replaces temporal redundancy with cross-modal alignment between appearance and depth. Within the GoT-SAC framework, we integrate a pixel-aligned RGB-D input into the Transformer encoder, enabling the attention mechanism to simultaneously capture semantic textures and obstacle geometry. A comprehensive systematic ablation study was conducted across five modality variants (4RGB, RGB-D, G-D, 4G-D, and 4RGB-D) and three fusion strategies (early, parallel, and late) under identical hyperparameter settings in a controlled simulation environment. The proposed RGB-D early fusion achieved a 40.0% success rate and +94.1 average reward, surpassing the canonical 4RGB baseline (28.0% success, +35.2 reward), while a tuned configuration further improved performance to 54.0% success and +146.8 reward. These results establish early pixel-level multimodal fusion (RGB-D) as a principled and efficient successor to temporal stacking, yielding higher stability, sample efficiency, and geometry-aware decision-making. This work provides the first controlled evidence that spatially aligned multimodal fusion within Transformer-based DRL significantly enhances mapless navigation performance and offers a reproducible foundation for sim-to-real transfer in autonomous mobile robots.

Keywords:

autonomous mobile robots; deep reinforcement learning; goal-guided navigation; multimodal fusion; RGB-D perception; transformer

1. Introduction

The classical paradigm for autonomous navigation has long relied on a well-established pipeline consisting of Simultaneous Localization and Mapping (SLAM), precise localization, and explicit path planning algorithms. This approach has demonstrated remarkable success in structured, static environments where high-precision maps can be constructed and maintained [1,2]. The maturity of this framework is evidenced by its widespread adoption in industrial robotics and controlled environments, where modern algorithms such as Hybrid A* [3], RRT* [4], and Probabilistic Roadmaps (PRM*) [5] have proven their efficacy in finding optimal paths through known obstacle configurations. Despite continuous algorithmic improvements, the classical paradigm suffers from several fundamental limitations that become increasingly problematic in real-world deployment scenarios [2,6]. These methods achieve strong performance in structured static environments but depend on accurate maps and heavy computation, which limits their adaptability in dynamic or unknown spaces.

To overcome these limitations, mapless navigation approaches have been developed. Instead of building a map, robots use onboard sensors, such as cameras, LiDAR, or depth cameras, together with reactive control or learning-based heuristics to reach a goal [7]. Although more flexible, mapless methods often rely on simplified perception modules, rendering them brittle under visual aliasing, lighting changes, or ambiguous geometries.

To address this perceptual brittleness, recent end-to-end approaches have focused on more powerful representation learning. Transformers are particularly valued for their ability to model long-range dependencies and flexibly integrate multimodal tokens from raw sensor data.

These robust representations serve as inputs for the decision-making framework. A recent wave of learning-based navigation leverages deep reinforcement learning (DRL) to replace hand-engineered pipelines with end-to-end policies that map raw perception to control actions [8,9]. The fusion of powerful Transformer encoders for perception with DRL for control has produced significant milestones. A representative example is the goal-guided Transformer Soft Actor-Critic (GoT-SAC) by Huang et al. [10]. By prepending a goal token to the visual patch tokens, GoT-SAC enforces early goal conditioning, which improves sample efficiency and sim-to-real generalization compared to convolutional baselines [10].

Nevertheless, the framework’s reliance on four stacked RGB frames (4RGB) to encode temporal dynamics leaves the role of other sensory modalities, particularly depth, under-explored. Depth sensing provides explicit geometry, including obstacle proximity, free-space boundaries, and drivable surfaces, which RGB alone struggles to infer reliably, particularly under domain shifts, such as lighting variation. These properties make depth a natural candidate for improving safety and sim-to-real robustness, as emphasized in recent safety-aware Transformer designs such as InterFuser [11] and NavFormer [12].

In this paper, we investigate a foundational design choice in Transformer-enabled Deep Reinforcement Learning (DRL) navigation: the trade-off between temporal redundancy and multimodal fusion. We posit that advancing beyond the conventional 4RGB temporal stacking baseline, pioneered by Huang et al. [10], toward a robust RGB-D early fusion framework can yield significant gains in performance and stability. This shift leverages the synergistic integration of RGB’s rich appearance data with the precise geometric insights of depth, which are critical for collision-aware navigation in unknown environments. Crucially, this approach is grounded in practice, as commercial RGB-D sensors, such as Intel RealSense and ZED cameras, natively produce the pixel-aligned data streams utilized in this study. This alignment with common hardware helps to establish a new standard for enhancing success rates, training stability, and sim-to-real transferability. To substantiate this hypothesis, we conducted a rigorous systematic ablation study, evaluating diverse input modalities including grayscale-depth (G-D), four stacked grayscale-depth (4G-D), and four stacked RGB-depth (4RGB-D) configurations, alongside a comprehensive analysis of fusion strategies (early, parallel and late). All experiments were meticulously controlled, maintaining consistent Transformer and Soft Actor-Critic (SAC) architectures to isolate the impact of modality and fusion design on navigational performance [10]. To the best of our knowledge, this is the first controlled study on modality fusion design in transformer-based DRL navigation. This approach not only addresses the limitations of the temporal redundancy inherent in 4RGB but also aligns with emerging trends in multimodal DRL, as evidenced by recent studies [11,12], which underscore the superiority of geometry-aware representations in dynamic settings. Real-world deployment introduces additional challenges such as lighting-induced ground reflections that distort fisheye RGB images and dynamic obstacles with ground-similar colors, which have been reported to degrade navigation performance in prior physical evaluations of Transformer-based DRL frameworks [10]. These factors motivate our controlled simulation design to isolate modality and fusion effects before hardware testing.

Our primary contributions are as follows.

We demonstrate that an early fusion RGB-D input modality significantly outperforms the canonical 4RGB temporal stacking baseline. We establish that this approach replaces inefficient temporal redundancy with valuable spatial complementarity, yielding a more geometry-aware representation that enhances policy success, sample efficiency, and robustness.
We provide the first systematic ablation study that decouples input modality from fusion methodology in this architecture. By isolating the distinct contributions of color, depth, temporal stacking, and providing direct comparison of fusion strategies (early, parallel, and late), our study offers clear and evidence-based principles directly informing model design methodology.

2. Related Work

Mapless navigation research is actively addressing several critical challenges, such as bridging the sim-to-real gap where policies fail on physical hardware, achieving robust generalization to unseen environments, and safely handling dynamic obstacles. While much research focuses on novel algorithms to solve these complex problems, a foundational design choice has been largely overlooked regarding the most effective and efficient sensory input to feed these models. This study addresses this gap by analyzing the trade-off between temporal redundancy (4RGB) and multimodal fusion (RGB-D).

Depth is a well-established modality in robotics and is essential for geometry-aware navigation. Classical planners leverage depth or LiDAR to construct occupancy grids and local cost maps [1]. In learning-based navigation, depth has been shown to improve robustness under domain shifts, such as lighting variation, by providing explicit obstacle proximity and traversability cues that RGB alone cannot reliably infer [13,14,15]. Recent studies continue to affirm this, leveraging RGB-D sensors for complex DRL-based tasks such as visual navigation in crowded environments [16] and improving localization-awareness during mapless navigation [17]. These findings highlight the critical role of depth in enhancing the safety and sim-to-real generalization of modern learning-based navigation policies [18].

Building on geometry-aware inputs, such as depth, a multimodal DRL system must address the critical design choice of where and how to fuse these heterogeneous data streams. Three canonical strategies are typically considered: early, parallel, and late fusion [19]. Early fusion concatenates modalities at the pixel/patch level, maximizing cross-modal interactions but risking redundancy [11]. Parallel fusion maintains separate streams for each modality before combining them in the transformer; this approach imposes a significant computational burden by doubling the token length and computational cost [11]. Late fusion integrates modalities only after independent encoding, reducing the token count but risking information loss [20,21]. Each of these approaches has been used in perception–action pipelines, particularly in camera–LiDAR fusion for autonomous driving [11,12].

Recent studies have increasingly turned to the Transformer architecture, whose inherent ability to model long-range dependencies makes it a powerful tool for fusing multimodal tokens generated from sensory inputs. A foundational concept in this domain is goal-conditioned learning, in which the policy is explicitly guided by the target. Earlier approaches, such as the work by Zhu et al. [8], typically implemented this by concatenating goal information with latent features after the primary perception encoding. However, this method decouples the goal from the scene representation process, which can result in the learning of goal-irrelevant features and thus poor data efficiency [8]. To address this limitation, Huang et al. introduced the Goal-guided Transformer (GoT), which innovatively treats the goal as a direct input to the scene encoder via a dedicated goal token [10]. This architectural shift forces the model’s self-attention mechanism to couple the visual representation with the goal from the earliest stage, motivating the policy to concentrate on goal-relevant features and thereby enhancing data efficiency. Beyond GoT-SAC, other transformer-DRL paradigms illustrate the diversity of designs: Decision Transformer (DT), which conditions on returns-to-go rather than explicit goals [22]; PoliFormer, which uses a causal transformer decoder with distributed Proximal Policy Optimization (PPO) to achieve large-scale zero-shot transfer [23]; and InterFuser, which employs cross-modal attention to fuse the camera and LiDAR for safety-critical driving [11]. This trend continues with recent works exploring hierarchical DRL with spatial memory [24], vision-based DRL for autonomous drones [25], and the use of Decision Transformers for navigation tasks [26]. Concurrently, other works are leveraging large pre-trained models for language-conditioned manipulation, further expanding the capability of autonomous agents [27]. While these frameworks demonstrate the power of transformers, a direct numerical comparison with our work is not feasible, as they address different tasks, sensors, and algorithms. For example, InterFuser is designed for autonomous driving using camera-LiDAR fusion, a different domain and sensor suite. Similarly, NavFormer utilizes a different DRL algorithm (PPO) in a different simulation setup. Therefore, our study’s contribution is not to benchmark against these specific architectures, but rather to conduct the first controlled, systematic ablation on input modality (4RGB vs. RGB-D) and fusion strategy (early, parallel, late) within a unified framework (GoT-SAC), a foundational question not addressed by prior works.

Mapless navigation has emerged as a significant strategy in robotics, focusing on developing policies that directly utilize raw sensor inputs, such as visual images, to operate in unknown environments [8,28]. Early DRL approaches relied on convolutional encoders coupled with actor–critic methods such as ConvNet-SAC [29,30], where stacked image frames were used as inputs to learn collision-free navigation policies. Extensions have incorporated attention mechanisms, such as ViT-SAC [31,32], and imitation learning frameworks, such as MultiModal CIL [33,34], showing that learning-based agents can generalize to unseen layouts. These studies established the feasibility of DRL for navigation but also revealed high sensitivity to input modality choices, motivating a systematic analysis.

3. Methodology

3.1. GoT-SAC Framework

In this study, we adopt the Goal-guided Transformer Soft Actor–Critic (GoT-SAC) framework proposed by Huang et al. [10] as our reference baseline. This framework integrates transformer-based tokenization with the Soft Actor–Critic (SAC) algorithm [29], offering a robust and data-efficient foundation for autonomous navigation in unknown environments [10]. In this formulation, each state was represented by four consecutive RGB frames captured using a fisheye camera. The frames were stacked along the channel dimension, resulting in an input tensor

o_{t}^{4 RGB} \in R^{128 \times 160 \times 12}

, in which the channels represented four frames multiplied by three color channels. This temporal stacking encodes short-term motion cues, that is, changes in pixel intensity that correlate with movement, but provides no explicit geometric information. Furthermore, this reliance on temporal stacking introduces two key inefficiencies, which motivate our investigation. First, in many navigation scenarios, consecutive stacked frames have a very high temporal correlation and are thus redundant, offering little new information while increasing the input dimensionality (12 channels). Second, it forces the model to solve a more complex learning problem: it must implicitly infer critical geometric information (like obstacle proximity) from subtle, ambiguous pixel changes between frames, rather than being provided with an explicit geometric signal. This indirect approach can decrease sample efficiency and stability, as the agent must learn to interpret optic flow cues instead of directly reading depth data.

The navigation point goal is expressed in polar coordinates relative to the robot,

g_{t} = [r_{t}, θ_{t}]

. This goal vector is linearly projected into the transformer embedding space using a Multi-Layer Perceptron (MLP) and serves as a dedicated goal token, which is prepended to the sequence of visual patches. Each image observation was divided into non-overlapping patches of size

16 \times 20

. With twelve channels (four stacked frames

\times

three RGB channels), each patch was flattened into a 3840-dimensional vector and then projected onto a 256-dimensional embedding. For an image of size

128 \times 160

, the patch tokenization process yielded

8 \times 8 = 64

tokens, and the addition of the goal token produced a transformer input sequence of length 65. This token sequence was processed using a lightweight transformer encoder with two blocks and four attention heads, an architecture kept consistent with the original GoT-SAC framework [10] to isolate the impact of the input modality. Learnable position embeddings are added to the input tokens to retain spatial information, following standard ViT practice [10]. Through self-attention, the prepended goal token dynamically interacts with the visual tokens, allowing the representation to focus on the goal-relevant features. The encoded representation is then passed through the SAC policy and value networks. The actor employs a Gaussian policy head that outputs continuous control commands, specifically linear velocity (

v

) and angular velocity

(ω) .

The critic uses a double-Q network, implemented as an MLP, to mitigate overestimation bias. To form the critic’s input, the final feature representation of the state is first obtained by selecting the output embedding corresponding to the transformer’s prepended goal token. This state vector is then concatenated with the action vector

(v

,

ω)

. The resulting combined vector is processed through the MLP heads to estimate the two Q-values

{(Q}_{1} (s, a)

and

Q_{2} (s, a))

. Training follows the entropy-regularized SAC objective, as shown in Equation (1):

J (π) = E_{(s, a) \sim D} [Q_{π} (s, a) - α \log π (a∣ s)]

(1)

In this objective,

Q_{π} (s, a)

represents the soft action-value function learned by the critic, while the entropy term

α \log π (a∣ s)

encourages policy exploration, with its weight balanced by the temperature coefficient

α

, which is tuned automatically during learning. To guide this process, the original framework utilizes a carefully designed reward function that combines continuous and sparse signals to enhance convergence efficiency. As shown in Equation (2), the total reward

r_{t}

is a sum of four distinct components:

r_{t} = r_{h} + r_{a} + r_{g} + r_{c}

(2)

The first term,

r_{h}

, is a heuristic reward that provides a dense signal encouraging progress toward the goal. It is calculated as the reduction in Euclidean distance to the goal between timesteps, scaled by a constant weight

η_{h}

, as defined in Equation (3):

r_{h} = η_{h} \times ({‖p_{t - 1}^{〈x, y〉} - q^{〈x, y〉}‖}_{2} - {‖p_{t}^{〈x, y〉} - q^{〈x, y〉}‖}_{2})

(3)

The second term,

r_{a}

, is an action reward designed to promote efficient trajectories by rewarding forward linear velocity

v_{t}

, while penalizing the absolute angular velocity

ω_{t}

(scaled by a separate weight coefficient

η_{a}

), to discourage excessive steering, as shown in Equation (4):

r_{a} = v_{t} - η_{a} \times a b s (ω_{t})

(4)

Finally, two sparse terminal rewards,

r_{g}

and

r_{c}

, provide definitive signals for task completion. A large positive reward is given for reaching the goal within a tolerance radius

ξ

, while a large negative penalty is applied for collisions, as specified in Equations (5) and (6):

r_{g} = \{\begin{matrix} + 100, i f d_{t} \leq ξ \\ 0, o t h e r w i s e \end{matrix}

(5)

r_{c} = \{\begin{matrix} - 100, i f c o l l i s i o n \\ 0, o t h e r w i s e \end{matrix}

(6)

This baseline provides a fair and reproducible reference point. It has demonstrated strong performance in prior work [10], benefits from the data efficiency of SAC’s off-policy nature, and incorporates early goal conditioning through a prepended goal token. While these properties provide a rigorous foundation, the baseline’s effectiveness is limited by the interplay between its architecture and input data. The Vision Transformer architecture must learn spatial relationships from raw patches, making it highly dependent on the quality of its input tokens. The baseline’s reliance on a 4RGB modality exacerbates this dependency; it forces the model to infer critical geometric information, such as obstacle proximity, from subtle and often ambiguous temporal changes in pixel color. This indirect approach to geometry motivates our investigation into a sensory modality that provides explicit spatial information from the start.

3.2. Proposed RGB-D Modality

Although Huang’s 4RGB baseline encodes short-term dynamics through temporal stacking, it lacks explicit geometric awareness, which is a critical component for safe and robust navigation in real-world robotics [10]. To address this limitation, our proposed method utilizes a single-frame, four-channel RGB-D input. This approach yields an RGB-D input that naturally integrates both appearance and geometry in a spatially aligned manner. RGB provides texture, contrast, and semantic cues useful for distinguishing walls, corridors, and objects, whereas the aligned depth channel provides metric information regarding obstacle proximity, free-space boundaries, and drivable surfaces [20,21]. A visual representation of the proposed RGB-D pipeline is shown in Figure 1, which integrates an RGB image and its corresponding depth channel as visual state input. Figure 1 contrasts the RGB image with its corresponding depth channel. Depth is also invariant to lighting conditions, which constitutes a common failure mode of RGB-only perception and has been shown to improve robustness under such domain shifts [8,35].

The resulting depth values were then linearly normalized to the range [0, 1] for numerical stability. Invalid or missing values, which can occur owing to occlusion, were set to zero. The resulting input is a four-channel tensor,

o_{t}^{RGBD} \in R^{128 \times 160 \times 4}

, where the channels correspond to the three RGB components and the depth map.

The tokenization process was held identical to that of the 4RGB baseline. Images were patchified into non-overlapping blocks of size

16 \times 20

. Each patch was then flattened into a vector whose dimension is determined by the formula

d_{patch} = P_{H} \times P_{W} \times C

, where

P_{H}

represents the patch height,

P_{W}

the patch width, and

C

the number of input channels. For the proposed four-channel RGB-D input, this yields a 1280-dimensional vector, as calculated in Equation (7):

d_{patch} = P_{H} \times P_{W} \times C = 16 \times 20 \times 4 = 1280

(7)

These vectors were projected into the same 256-dimensional embedding space, producing 64 patch tokens. Prepending the linearly projected goal token again yields a sequence of length 65, which remains unchanged from the baseline. Since the Transformer encoder’s architecture (two blocks, four heads, 256 hidden dimensions) is held constant for both the 4RGB and RGB-D configurations, any observed differences in performance can be directly attributed to the change in input modality. The policy and critic networks, which operate on the latent features from the Transformer encoder, remain unchanged. The only architectural modification is in the initial patch projection layer of the encoder. This layer is adjusted to accept the different input patch dimensions while still projecting to the same 256-dimensional embedding space. This ensures all other network parameters and the transformer sequence length remain constant.

Treating the depth as an additional channel is a design choice motivated by several key factors. First, it enforces strict architectural fairness for comparison, as the embedding size, sequence length, and number of parameters remain constant between the 4RGB and RGB-D models. This approach is intended to prevent confounds from additional modules or parameters and faithfully reflects the data format of real RGB-D sensors. To preserve the crucial spatial arrangement of the visual data, learnable position embeddings are added to the input tokens before they are processed by the Transformer encoder. The central hypothesis behind this design is that channel-level integration will enable the self-attention mechanism to jointly exploit correlations between appearance and geometry from the first layer. Consequently, we posit that the prepended goal token can attend simultaneously to textures and distances, thereby learning goal-aware representations that are both semantically and geometrically grounded.

To ensure a fair comparison, the learning protocol, including all hyperparameters and reward structures, was held constant across all experiments, as detailed in Section 4.2. The complete training loop for our proposed RGB-D early-fusion method is detailed in Algorithm 1.

Algorithm 1: RGBD Early-Fusion Goal-Guided Transformer Reinforcement Learning
1:	Initialize GoT parameters $φ^{*}$
2:	Initialize actor and critic parameters $ϕ, θ$
3:	Initialize entropy temperature $α$
4:	Initialize batch size $N$ and replay buffer $D$
5:	Set target parameters $θ_{t a r g} \leftarrow θ$
6:	for episode = 1 to $E$ do
7:	Reset environment; obtain initial goal $s_{g o a l, 1}$ and robot state $s_{1}$
8:	for $t = 1$ to $T$ do
9:	Sense: capture RGB image $I_{t}$ $(H \times W \times 3)$ and depth map $D_{t}$ $(H \times W)$
10:	Early fusion: $X_{t} \leftarrow s t a c k (n o r m (I_{t}), n o r m (c l i p (D_{t})))$ (4 channels)
11:	Goal embedding: $g_{t} \leftarrow M L P (s_{g o a l, t}; η)$
12:	Patchify & embed: ${\{p^{(i)}\}}_{i = 1}^{M} \leftarrow P a t c h i f y (X_{t}; P); {z_{t}}^{(i)} \leftarrow E m b e d ({p_{t}}^{(i)})$
13:	Scene representation (GoT): $h_{t} \leftarrow G o T ([g_{t}; {z_{t}}^{(1)}, \dots, {z_{t}}^{(M)}]; φ^{*})$
14:	Act: sample $a_{t} ~ π_{φ} (\cdot \| h_{t})$ ; step environment $\to (r_{t}, s_{t + 1}, s_{g o a l, t + 1})$
15:	Next fused frame: get $(I_{t + 1}, D_{t + 1})$ and form $X_{t + 1}$ as above
16:	Store: $D \leftarrow D \cup$ ${(X_{t}, s_{g o a l, t}, a_{t}, r_{t}, X_{t + 1}, s_{g o a l, t + 1})}$
17:	if time to update critic then
18:	Sample ${(X_{i}, s_{g o a l, i}, a ᵢ, r ᵢ, X_{i}^{'}, s_{g o a l, i}^{'})}_{i = 1}^{N} ~ D$
19:	$h_{i} \leftarrow G o T ([M L P (s_{g o a l, i}); E m b e d (P a t c h i f y (X_{i}))])$
20:	$h_{i}^{'}$ from ( $X_{i}^{'}$ , $s_{g o a l, i}^{'}$ ) analogously
21:	Compute critic loss $L_{Q} (θ)$ ; update $θ \leftarrow θ - λ_{Q} \nabla_{θ} L_{Q}$
22:	end if
23:	if time to update actor then
24:	Compute actor loss $L_{π} (ϕ)$ with entropy $α$ ; update $ϕ \leftarrow ϕ - λ_{π} \nabla_{ϕ} L_{π}$
25:	if automatic entropy then
26:	Update $α$
27:	end if
28:	end if
29:	if time to update target then
30:	$θ_{t a r g} \leftarrow τ θ + (1 - τ) θ_{t a r g}$
31:	end if
32:	end for
33:	end for

3.3. Supporting Ablations

To confirm that the superiority of RGB-D over 4RGB is not incidental, we designed three supporting ablations that systematically isolated the contributions of color, geometry, and temporal redundancy. To ensure a fair and controlled comparison in all subsequent experiments, all architectural parameters (patch size, transformer blocks, attention heads) and SAC hyperparameters were held constant. This allows any observed performance differences to be attributed solely to the modality or fusion strategy under investigation. This ensures that observed differences in performance could be attributed solely to the input modality.

The first ablation, gray and depth channels (G-D), evaluates whether geometry alone is sufficient without color semantics. In this setting, a grayscale channel is computed from RGB via luminance projection and concatenated with the depth map, producing an input tensor

o_{t}^{GD} \in R^{128 \times 160 \times 2}

. Patch tokenization with patches of size

16 \times 20

yields a flattened vector of dimension

16 \times 20 \times 2 = 640

for each patch. After linear projection, 64 tokens were generated and the goal token was prepended to produce a sequence length of 65. If geometry alone is adequate for navigation, G-D would approximate RGB-D performance. Instead, as we show in Section 4, G-D significantly underperforms, confirming that the geometry and appearance are complementary.

The second ablation, four stacked gray and depth channels (4G-D), tests whether temporal redundancy can compensate for the absence of color. Four consecutive G-D frames are stacked to yield an input tensor

o_{t}^{4 GD} \in R^{128 \times 160 \times 8}

. Patch tokenization produces flattened vectors with dimension

16 \times 20 \times 8 = 2560

. As with the other settings, 64 patches and one goal token yielded a sequence of length 65. The rationale for this experiment is that temporal stacking provides motion cues that may help disambiguate geometry. This ablation is therefore designed to test the hypothesis that such motion cues can serve as a sufficient substitute for the rich semantic information provided by color.

The third ablation, four stacked RGB and depth (4RGB-D), investigates whether adding temporal redundancy to RGB-D yields further gains or simply introduces redundancy and instability. Four consecutive RGB-D frames were stacked to produce a 16-channel input tensor,

o_{t}^{4 RGBD} \in R^{128 \times 160 \times 16}

. Patch tokenization then produces vectors with dimensions

16 \times 20 \times 16 = 5120

. This setup is designed to evaluate the trade-off between the potential benefits of including temporal dynamics (appearance, geometry, and motion) and the significant computational challenge posed by a dramatically increased input dimensionality.

3.4. Fusion Strategy Variants

Beyond the choice of modality, an equally important design axis is the fusion of different sensory channels. Even when using the same RGB-D observations, the stage at which appearance and geometry are integrated can significantly influence the representations learned by the Transformer. To examine this, we implemented and evaluated three canonical fusion strategies: early, parallel, and late integration.

In the early fusion strategy, RGB and depth were concatenated at the pixel level before the patch tokenization, forming a four-channel tensor. Each

16 \times 20

patch was flattened into a vector of dimension

16 \times 20 \times 4 = 1280

, which was then projected into the transformer embedding space. The resulting 64 visual tokens, along with the prepended goal token, yielded a sequence length of 65. This approach allows for the immediate establishment of cross-modal interactions, enabling self-attention layers to jointly align texture and geometry. It also matches the design of Huang’s 4RGB baseline [10], where temporal stacking was used instead of depth, thus providing a fair point of comparison.

The parallel encoding strategy maintains modality separation until after the patch tokenization. RGB patches, each of dimension

16 \times 20 \times 3 = 960

, and depth patches, each of dimension

16 \times 20 \times 1 = 320

, were linearly projected through separate embedding layers, augmented with modality-specific type embeddings, and concatenated into a single sequence. This doubles the token length, yielding 64 RGB tokens, 64 depth tokens, and one goal token for 129 tokens. Thus, the transformer is required to attend across twice as many tokens, which increases the computational cost but also provides the potential for modality-specific specialization.

The late fusion strategy integrates the geometry at the patch level in a more compressed form. RGB patches are embedded as usual, whereas the corresponding depth information is spatially pooled into a scalar statistic per patch. This scalar was concatenated with the RGB vector before projection, producing fused patch vectors of dimension

(16 \times 20 \times 3) + 1 = 961

. These were projected into 256-dimensional embeddings, producing 64 fused tokens plus the prepended goal token, for a sequence length of 65. In this design, depth influences the representation but does not exist as a standalone token stream. The rationale is that compressing depth into lightweight descriptors may increase efficiency and robustness to noise, although potentially at the cost of discarding fine-grained geometric details. The distinct implementations of these three fusion variants are outlined in Algorithm 2.

Algorithm 2: Fusion Strategies A–C for RGBD Navigation

Input:

R G B \in R^{H \times W \times 3}

Depth

\in R^{H \times W \times 1}

,

Goal

\in R^{g}

Data: Patch size

P

; Transformer encoder with

L = 2

layers,

H = 4

heads,

d = 256

Result: Continuous control (v, ω) via SAC actor; double-Q critic values

Shared setup.

g \leftarrow

Project(Goal,

256

)

Variant A — Early Fusion

X \leftarrow

ConcatChannels(RGB, Depth)

\in R^{H \times W \times 4}

{p_{i}}_{1}^{64} \leftarrow

Patchify(

X, P

); Tok ← { Project(

p_{i}

, 256) }

Seq

\leftarrow [g] ∥

Tok

Z \leftarrow

Transformer(Seq)

(v, ω) ← SACActor(Z[0]); Critic ← SACCritic(Z[0])

Variant B — Parallel Fusion

{r_{i}}

,

{d_{i}}

← Patchify(RGB, P), Patchify(Depth, P)

{T o k}_{R G B}

← {TypeEmb(Project(

r_{i}

, 256), RGB) }

{T o k}_{D E P}

← {TypeEmb(Project(

d_{i}

, 256), DEPTH) }

Seq

\leftarrow [g] ∥ {T o k}_{R G B} ∥ {T o k}_{D E P}

Z \leftarrow

Transformer(Seq)

(v, ω) \leftarrow

SACActor(Z[0]); Critic ← SACCritic(Z[0])

Variant C — Late Fusion

{r_{i}}

,

{d_{i}}

← Patchify(RGB, P), Patchify(Depth, P)

for

i = 1 \dots 64

do

s_{i}

← PoolDepthPatch(

d_{i}

);

f_{i}

← (

r_{i}

∥ s_{i}

)

{T o k}_{i}

← Project(

f_{i}

, 256)

end for

Seq

\leftarrow [g] ∥ {{T o k}_{i}}

Z ← Transformer(Seq)

(v, ω) \leftarrow

SACActor(Z[0]); Critic ← SACCritic(Z[0])

3.5. Implementation Details

Sensor observations are pre-processed into a consistent format for the policy network. All visual inputs are resized to a fixed resolution of

128 \times 160

pixels to balance informational content and computational load. Resizing is applied to both streams to place them on the same

128 \times 160

grid; standard image resampling can introduce negligible sub-pixel discrepancies that do not affect our alignment assumption. The 8-bit RGB channels are linearly normalized to the range [0, 1]. The raw depth stream is clipped to a maximum effective range of 5.0 m, a value selected as a practical choice tailored to our indoor layouts, reflecting the operational capabilities of common commercial sensors like the Intel RealSense D435i (approximately 3 m range) and D455 (approximately 6 m range). This approach prioritizes critical near- and mid-range obstacles while mitigating the influence of potential far-field sensor noise that can degrade learning performance. After clipping, depth values are also normalized to [0, 1], with any invalid pixels set to zero. This two-step process was a deliberate design choice. Normalizing the clipped depth values to [0, 1] is a standard practice in multimodal learning to ensure numerical stability. It prevents the depth channel (originally 0–5.0 m) from numerically overpowering the RGB channels (which are also normalized to [0, 1]), ensuring all input features have a comparable scale. Setting invalid sensor readings (e.g., from occlusions) to zero is a common convention. While this creates a potential ambiguity (0.0 could mean 0 m or an invalid reading), the Transformer’s patch-based, contextual self-attention is well-suited to disambiguate this, and we observed no evidence of this choice negatively impacting policy learning. The three-channel RGB tensor and the single-channel depth tensor are then concatenated along the channel axis to form the final four-channel observation tensor,

o_{t}^{RGBD} \in R^{128 \times 160 \times 4}

.

In our simulation, this direct channel-wise concatenation is sufficient for creating a spatially aligned RGB-D input because both modalities are rendered by a single pinhole RGB-D virtual camera that shares identical intrinsic and extrinsic parameters. This setup yields pixel-level correspondence by construction (modulo image resampling), eliminating the need for geometric re-projection in the simulation. This approach is conceptually analogous to commercial RGB-D sensors, such as the Intel RealSense family, which provides depth-to-color-aligned data streams through their SDKs and dedicated ROS topics (e.g., /camera/aligned_depth_to_color/image_raw). On such hardware, pixel-wise correspondence is achieved through a calibration-based geometric procedure. The process, defined by Equations (8)–(10), begins by un-projecting a depth pixel

(u_{d}, v_{d})

with a measured range

z

into a 3D point

X_{d}

in the depth camera frame using its intrinsic matrix

K_{d}

:

X_{d} = z {K_{d}}^{- 1} {[(u_{d}, v_{d}, 1)]}^{T}

(8)

This 3D point is then transformed into the color camera coordinate frame,

X_{c}

, using the extrinsic rotation

R \in S O (3)

and translation

t \in R^{3}

matrices:

X_{c} = R X_{d} + t

(9)

Finally,

X_{c}

is projected onto the color image plane to find the corresponding pixel coordinates

(u_{c}, v_{c})

using the color camera intrinsic matrix

K_{c}

:

λ {[(u_{c}, v_{c}, 1)]}^{T} = K_{c} X_{c}, λ > 0

(10)

where

λ

is a scaling factor. In this standard pinhole projection model,

λ

represents the depth (the

z

-coordinate) of the point

X_{c}

. The pixel coordinates

(u_{c}, v_{c})

are thus found by calculating

K_{c} X_{c}

and then dividing the first two components of the resulting vector by its third component,

λ

(the depth). Accordingly, by subscribing to the aligned depth-to-color topic on physical sensors and applying the same resize and normalization, the resulting RGB-D tensor is identical in format to the one used in our simulation, ensuring our policy is directly compatible with real-world deployment at the input-format level; minor residuals from resampling or parallax may persist in practice.

To further bridge the sim-to-real gap and enhance policy robustness, we deliberately perturb the simulated inputs. Gaussian noise (std. dev.

σ = 0.05

) is added to the RGB stream to prevent the agent from overfitting to idealized data and to expose it to imperfections characteristic of real sensors. To ensure a fair comparison across all experiments, architectural parameters—including patch size

(16 \times 20)

, the number of visual tokens (64), and transformer dimensions—were held constant. The only modification between modalities was to the input channel count of the initial network layer to ensure dimensional compatibility. Training followed the SAC algorithm, utilizing a replay buffer of 20,000 transitions. The policy was guided by a reward function structured as a combination of a dense heuristic reward for goal progress, an action reward to encourage smooth trajectories, and large sparse terminal rewards: a positive reward of +100 for reaching the goal and a negative penalty of −100 for collisions. The key hyperparameters used for all experiments are detailed in Table 1.

3.6. Evaluation Protocol

All agents were evaluated using a unified protocol designed to balance statistical rigor and computational feasibility. Training is paused every 50 episodes, at which point the current policy is evaluated on 50 rollouts, a number chosen to provide stable estimates of policy quality—sampled from unseen start–goal configurations. To ensure a reasonable measure of reproducibility without excessive computation, each experiment was repeated three times with independent random initializations, and the results were presented as the mean and standard deviation across repetitions. During evaluation, the learned policy was executed in greedy mode, meaning that deterministic actions were taken without exploration noise to isolate the policy’s performance. Key metrics include the average episodic reward, success rate, and variance across repetitions, which are used to assess performance and convergence behavior. The average episodic reward is computed over

N = 50

rollouts and

S = 3

repetitions for statistical validation, according to Equation (11):

\bar{R} = \frac{1}{S \cdot N} \sum_{s = 1}^{S} \sum_{i = 1}^{N} R_{i, s}

(11)

where

R_{i, s}

is the total reward of rollout

i

from the experimental run

s

. The success rate is the fraction of rollouts in which the agent successfully reaches the goal. Variability is quantified through standard deviation on learning curves. This protocol is consistent with established practices in DRL navigation benchmarks, facilitating a fair comparison with previous work [8,9].

4. Experiments and Results

4.1. Simulation Environment

All experiments were conducted in a simulation framework using ROS (Noetic) with Gazebo Classic (Version 11), which provides physics-accurate interactions and reproducible navigation tasks. Training and inference were performed on a workstation equipped with an Intel Core i9-14900KF CPU, an NVIDIA RTX 5000 Ada Generation GPU (32 GB VRAM), and 64 GB of RAM. The platform is the AgileX ScoutMini, a differential-drive wheeled robot, selected for its widespread use in research on indoor navigation. The robot is actuated through the continuous control of the linear velocity

v \in [0, 1.0] m / s

and angular velocity

ω \in [- 2.0, 2.0] r a d / s

.

The sensing suite consists of a forward-facing RGB camera paired with a pixel-aligned depth sensor, simulating commercial RGB–D devices such as Intel RealSense or ZED stereo cameras. Odometry provides the robot pose relative to the navigation goal, expressed in polar coordinates

(r, θ)

. These observations were designed to directly mirror real-world sensor setups, ensuring that the simulation findings remain transferable in principle. Our simulation environment and the unmanned ground vehicle (UGV) model are depicted in Figure 2.

Each navigation episode began by placing the robot at a random position and sampling a random goal location within the environment. The task was considered successful if the robot reaches its goal within a specified tolerance radius. Episodes terminate upon goal reaching, collision with an obstacle, or exceeding the maximum step limit of 300.

To ensure a robust and fair evaluation, all experiments were conducted in procedurally generated environments with randomized layouts containing static obstacles such as blocks, walls, and barriers. Both training and evaluation layouts were drawn from the same distribution, ensuring that all modality and fusion strategies were compared under identical task conditions, thereby facilitating a direct and principled comparison with the baseline framework by Huang et al. [10]. The key parameters of the training and evaluation environment are summarized in Table 2.

4.2. Training Protocol

Each agent was trained for 500 episodes, with each episode capped at 300 simulation steps. At each step, the agent selects a continuous action

(v, ω)

, which is executed in Gazebo. The resulting next state, reward, and termination flag were stored in the replay buffer for subsequent updates. The replay buffer has a fixed capacity of 20,000 transitions and supports prioritized experience replay [35].

Training follows a Soft Actor–Critic framework [29]. Both actor and critic are updated off-policy from replay samples. The target networks are updated using a smoothing coefficient

τ = 0.005

. The discount factor is set to

γ = 0.999

. The learning rates for the actor and critic were both fixed at

1 \times 10^{- 3}

. The entropy coefficient

α

was automatically tuned to balance exploration and exploitation.

4.3. Results and Discussion

Based on the training protocol, we evaluated the learning performance of the proposed RGB-D (Early Fusion) method in comparison with the canonical 4RGB baseline. The evaluation focused on convergence speed, training stability, and overall navigation performance under identical experimental conditions. Figure 3 illustrates the moving average (window size

k = 20

) of episodic rewards over 500 training episodes for both the 4RGB baseline and the proposed RGB-D (Early Fusion) method. The RGB-D agent not only converges faster but also achieves higher final performance and exhibits lower variance across three independent runs. A direct comparison between Huang’s 4RGB baseline and our proposed RGB-D formulation demonstrates the clear benefit of incorporating explicit geometry. Evaluated using the key metrics of success rate and average reward, the RGB-D modality reaches 40.0% ± 4.2% success and a +94.1 ± 15.8 reward. This significantly surpasses the 4RGB baseline, which achieved 28.0% ± 3.5% success and a +35.2 ± 10.1 reward. This corresponds to a 12% increase in the success rate and nearly a threefold improvement in the reward. The learning curves, presented in Figure 3, further reveal that RGB-D converges more rapidly, attains higher final performance, and exhibits lower variance across experimental runs. These comparative results are summarized in Table 3. In addition to its superior performance, the RGB-D modality offers a significant computational advantage. While the core Transformer architecture was held constant for a fair comparison, the initial patch embedding layer differs significantly. The 4RGB baseline flattens each patch into a 3840-dimensional vector (

16 \times 20 \times 12

), whereas our RGB-D modality’s patches are flattened into a 1280-dimensional vector (

16 \times 20 \times 4

). As both are projected to the same 256-dimensional embedding space, the 4RGB model’s embedding layer contains 983,296 parameters (calculated as

3840 \times 256 + 256 b i a s e s

). In contrast, our RGB-D model’s layer contains only 327,936 parameters (calculated as

1280 \times 256 + 256

). This 66.7% reduction in parameters for the input-processing stage directly implies a lower computational load and faster inference capability, highlighting another key practical advantage of our approach over temporal stacking.

To systematically dissect the factors behind RGB-D’s superior performance and to clarify the distinct roles of appearance, geometry, and temporal redundancy, we conducted a comprehensive ablation study. This study evaluated five modality variants: the proposed RGB-D, the 4RGB baseline, Gray and Depth (G-D) to isolate geometry, Four Stacked Gray and Depth (4G-D) to test temporal compensation, and Four Stacked RGB and Depth (4RGB-D) to assess the impact of high-dimensional temporal stacking.

Figure 4 presents the training reward curves for all five modality variants, placing the direct comparison from Figure 3 (our proposed RGB-D vs. the 4RGB baseline) into the broader context of the full ablation study. These curves are smoothed with a moving average (

k = 20

) to visualize learning stability and convergence over 500 episodes. The plot provides clear visual evidence of the RGB-D agent’s robust learning trajectory, which consistently achieves the highest rewards. In stark contrast, the erratic and low-reward curves of the G-D and 4RGB-D variants illustrate the instability caused by insufficient semantic cues and excessive input dimensionality, respectively. Even with the addition of temporal cues, the 4G-D variant fails to match the performance of color-enabled modalities, highlighting that motion information cannot fully substitute for rich appearance features.

Figure 5 provides a visual summary of the final navigation performance by comparing the success rate (bars) and the average reward (line) across the five tested modality variants: RGB-D, 4RGB, Gray and Depth (G-D), Four Stacked Gray and Depth (4G-D), and Four Stacked RGB and Depth (4RGB-D). The figure confirms that RGB-D achieved the highest performance, clearly surpassing the 4RGB baseline. Geometry-only inputs (G-D) and stacked temporal inputs (4G-D and 4RGB-D) are shown to underperform, establishing evidence that appearance and geometry are complementary and that temporal stacking alone is inadequate.

The ablation experiments, with results presented in Figure 4 and Figure 5 and summarized in Table 4, further clarified the role of the appearance, geometry, and temporal redundancy. The ablation experiments confirmed that appearance and geometry are complementary and essential. Geometry alone, as tested in the gray-and-depth (G-D) variant, proved insufficient. By removing color semantics such as wall–floor contrast and object boundaries, the policy struggled to navigate robustly, achieving only 14.0% success and an average reward of −1.25 with unstable convergence. We then investigated if temporal redundancy could compensate. Stacking four G-D frames (4G-D) did improve stability, increasing the success rate to 26.0% and the reward to +25.4. This confirms that motion cues from temporal stacking provide useful, stabilizing information. However, even with this improvement, 4G-D failed to approach the performance of the color-enabled variants, demonstrating that temporal cues cannot substitute for the rich semantic information provided by color. Finally, we investigated whether adding temporal data to our best modality would yield further gains. While the four-stacked RGB-D (4RGB-D) variant combines appearance, geometry, and temporal cues, the resulting 16-channel input created a dimensional explosion. The patch embeddings grew to a size of 5120, which destabilized the training process. Beyond this dimensional challenge, other factors likely contributed to this collapse. Our lightweight Transformer (2 blocks, 4 heads) was held constant across all experiments for a fair comparison, and it likely lacked the parameter capacity to effectively process such a high-dimensional input vector. Furthermore, this large input made the optimization problem significantly harder. The fixed SAC hyperparameters (such as the learning rate) used across all experiments were likely ill-suited for this more complex task, leading to the observed training instability. Consequently, the performance degraded markedly to just 16.0% success and a −6.7 reward. This outcome demonstrates that brute-force temporal stacking is computationally inefficient and ultimately less effective than a principled, single-frame fusion of complementary modalities.

Next, we examined how the stage of sensor fusion impacts performance. Figure 6 displays the training reward curves, calculated using a moving average (k = 20), for the three investigated fusion strategies: Early Fusion (Variant A), Parallel Encoding (Variant B), and Late Fusion (Variant C). The visualization highlights the critical role of fusion stage design, demonstrating that Early Fusion (Variant A) consistently yields the highest and most stable rewards throughout the 500 training episodes. In sharp contrast, Parallel Encoding (Variant B) exhibits an early and catastrophic performance collapse. The Late Fusion (Variant C) strategy shows a smoother but significantly weaker trajectory, converging below both Variant A and the Parallel Encoding collapse point.

Figure 7 presents a visual comparison of the final policy performance for the fusion strategies—Early Fusion (Variant A), Parallel Encoding (Variant B), and Late Fusion (Variant C)—by showing the success rate and average reward metrics. The figure clearly illustrates that Early Fusion (Variant A) significantly outperforms both the parallel and late fusion designs. This performance differential confirms that the pixel-level integration of RGB and depth provides the most effective cross-modal representation for the Transformer.

Table 5 quantifies the final performance metrics for the fusion strategies: Variant A (Early Fusion), Variant B (Parallel Fusion), and Variant C (Late Fusion). Early Fusion (Variant A) is numerically confirmed as superior, achieving 40.0% ± 4.2% success and a +94.1 ± 15.8 average reward. Parallel Fusion (Variant B) yielded the lowest performance, with 6.0% ± 1.5% success and a −10.4 ± 4.8 reward. Late Fusion (Variant C) achieved an intermediate but poor result of 16.0% ± 2.8% success and a −7.2 ± 3.1 reward. The table confirms the consistent underperformance of both parallel and late designs relative to the early fusion strategy.

Early fusion (Variant A), which concatenates RGB and depth at the pixel level before the patch tokenization, proved to be the most effective strategy, achieving 40.0% success and a +94.1 reward. In contrast, parallel encoding (Variant B), which doubles the token sequence length to 129, severely burdened the lightweight transformer. It is important to note that this performance collapse may reflect the fixed capacity of our encoder being insufficient for the longer sequence, rather than an inherent flaw in the parallel fusion concept itself under a different computational budget. This architectural mismatch led to unstable learning and a performance collapse to just 6.0% success and a −10.4 reward. Finally, late fusion (Variant C), which compresses depth into a scalar statistic per patch, provided smoother but weaker training. By discarding fine-grained geometric details, it converged to only 16.0% success and a −7.2 reward. These results provide a clear conclusion: for self-attention to be effective, cross-modal correlations between appearance and geometry must be available at the pixel or patch level. The learning dynamics of these strategies are shown in the training reward curves in Figure 6, with a summary of their final performance metrics presented in Figure 7 and Table 5.

The final stage of our systematic investigation explored performance optimization through hyperparameter tuning. Figure 8 illustrates the training reward curves, calculated using a moving average (k = 20), comparing Huang’s 4RGB baseline, the standard RGB-D (early fusion) variant, and the tuned RGB-D configuration. The visual data confirms that optimization—specifically halving the actor and critic learning rates—successfully mitigated late-episode instability in critic losses, resulting in improved training stability and convergence. The tuned model clearly achieves the highest overall reward, further amplifying the performance margin of RGB-D over the 4RGB baseline.

Figure 9 provides a final performance comparison across Huang’s 4RGB baseline, the original RGB-D policy, and the tuned RGB-D configuration. The figure distinctly shows that the tuned configuration achieves the highest success rate and average reward. This result underscores the inherent superiority of the RGB-D modality over 4RGB and highlights the potential for achieving substantial further gains through modest hyperparameter tuning.

Table 6 provides the definitive numerical quantification of the performance gains achieved following hyperparameter optimization. The table confirms that the RGB-D (Tuned) configuration secured the highest results across all experiments, achieving a 54.0% ± 3.8% success rate and a +146.8 ± 20.5 average reward. This optimized result represents a significant gain over the original RGB-D performance (40.0% ± 4.2% success, +94.1 ± 15.8 reward) and establishes a substantial lead over the 4RGB baseline (28.0% ± 3.5% success, +35.2 ± 10.1 reward).

By halving the actor and critic learning rates from 0.001 to 0.0005, we mitigated the late-episode instability in critic losses and improved convergence. This tuned RGB-D configuration achieves 54.0% success and +146.8 reward, outperforming both the original RGB-D and 4RGB baseline by a substantial margin. These findings demonstrate that RGB-D surpasses 4RGB under aligned conditions and allows further performance gains with modest tuning. Figure 8 illustrates the improved training stability of the tuned model, and Figure 9 compares its final performance against the baselines. The significant gains are also quantified in Table 6.

Together, these results establish RGB-D early fusion as a principled and superior successor to 4RGB. Ablation studies confirm that appearance and geometry are complementary, temporal redundancy is insufficient, and early fusion is the most effective integration strategy. Moreover, the tuned RGB-D configuration demonstrated that the framework can be further stabilized and improved, providing a strong foundation for future sim-to-real transfers.

5. Conclusions

Our systematic evaluation revealed that modality and fusion design are decisive factors in Transformer-enabled DRL navigation. The most significant finding is that an RGB-D early fusion modality consistently and significantly outperforms the canonical 4RGB temporal stacking baseline under identical configurations. Whereas the 4RGB baseline by [10] encodes only pixel intensity changes over time, RGB-D integrates appearance and geometry in a spatially aligned manner. This yields tokens that are inherently more informative, allowing the prepended goal token to attend simultaneously to semantic textures (e.g., walls and corridors) and geometric cues (e.g., obstacle proximity). In this sense, RGB-D replaces temporal redundancy with valuable spatial complementarity. From an attention perspective, this cross-modal alignment increases the mutual information between the goal and visual tokens, improving credit assignment during SAC updates.

The ablation results emphasize this complementary nature of color and depth. Depth alone (G-D) provides geometric safety margins but fails to resolve semantic ambiguities, while color alone lacks the explicit distance information needed to avoid risky trajectories. The results also clarify the limitations of temporal stacking; while short-term motion cues can offer some stability, they cannot compensate for a lack of rich sensory input and can lead to a dimensional explosion that destabilizes training. Furthermore, our analysis of fusion strategies confirms that early, patch-level fusion is the most effective approach, as it allows the self-attention mechanism to correlate appearance and geometry from the earliest stage. In contrast, parallel encoding overburdened the lightweight transformer, and late fusion discarded critical, fine-grained geometric detail.

Our primary contributions are therefore threefold. First, we demonstrate that replacing temporal stacking (4RGB) with a multimodal, single-frame input (RGB-D) yields substantial performance gains. Specifically, our proposed RGB-D method achieved a 40.0% success rate and a +94.1-average reward, marking a 12 percentage-point increase in success and a nearly threefold increase in reward over the 4RGB baseline’s 28.0% success and +35.2 reward. Second, we provide the first systematic ablation study isolating the effects of color, depth, and temporal stacking in this architecture, establishing clear evidence that appearance and geometry are complementary and that early fusion is the optimal integration strategy. Finally, we show that with modest hyperparameter tuning, the RGB-D agent’s performance can be further improved to a 54.0% success rate and +146.8 reward, confirming its robustness and potential as a superior foundation for DRL-based navigation.

While these findings provide strong evidence for RGB-D fusion, it is important to acknowledge the limitations of this study. Our experiments were conducted within a single, procedurally generated simulation environment composed of static obstacles. The policy’s robustness to more diverse architectural layouts or the introduction of dynamic elements remains an open question. Furthermore, although we incorporated sensor noise and clipping to promote sim-to-real transferability, the complexities of physical hardware—including calibration drift and challenging lighting conditions—are not fully captured here.

Future work should extend these findings in three key directions. First, the most crucial and immediate direction is the validation of this RGB-D policy on a physical platform. This priority is underscored by real-world challenges such as lighting-induced ground reflections that distort fisheye RGB images and dynamic obstacles with ground-similar colors, both of which have been shown to degrade navigation performance in prior physical evaluations of Transformer-based DRL frameworks [10]. This is the clear and necessary next step to confirm the sim-to-real transfer benefits and verify the policy’s effectiveness on hardware, such as the AgileX ScoutMini, ideally combined with domain randomization and enhanced sensor-noise modeling. Second, the policy’s generalization capabilities should be evaluated in more complex and realistic settings with dynamic obstacles, complex lighting variations, and more unstructured environments, as well as layouts drawn from different distributions to fully assess its robustness. Third, to better understand the policy’s internal decision-making process, future research should focus on model interpretability. This would include adding qualitative trajectory visualizations to compare the paths of different policies, as well as visualizing the self-attention maps to analyze how the goal token attends to specific RGB and depth features, both of which would provide valuable insight into the learned behavior. Finally, to address the challenge of temporal reasoning more effectively than brute-force stacking, future research should investigate more principled architectures, such as recurrent attention or memory-augmented transformers.

By clarifying the role of modality and fusion, this study provides both theoretical insight and practical guidance for designing the next generation of robust, geometry-aware navigation policies for autonomous mobile robots.

Author Contributions

Conceptualization, A.B.İ. and M.E.C.; methodology, A.B.İ.; software, M.E.C.; validation, A.B.İ. and M.E.C.; formal analysis, A.B.İ.; investigation, A.B.İ.; resources, M.E.C.; data curation, M.E.C.; writing—original draft preparation, M.E.C.; writing—review and editing, A.B.İ. and M.E.C.; visualization, M.E.C.; supervision, A.B.İ.; project administration, A.B.İ. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and data presented in this study are available on request from the corresponding author for the purposes of peer review. The source code will be made publicly available in a GitHub repository upon publication of this article.

Acknowledgments

This work has been supported by Kocaeli University Scientific Research Projects Coordination Unit under grant number 2026-3409 and Artificial Intelligence and Simulation Systems R&D (https://yapbenzet.kocaeli.edu.tr/, accessed on 11 December 2025) Laboratory, whose academic environment and guidance greatly contributed to the success of this work. The authors gratefully acknowledge the use of the laboratory’s computing facilities and TRUBA high-performance computing resources for conducting the experiments. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Thrun, S. Robotic Mapping: A Survey. In Exploring Artificial Intelligence in the New Millennium; Morgan Kaufmann: San Francisco, CA, USA, 2003; pp. 1–35. [Google Scholar]
Durrant-Whyte, H.; Bailey, T. Simultaneous Localization and Mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
Dolgov, D.; Thrun, S.; Montemerlo, M.; Diebel, J. Path Planning for Autonomous Vehicles in Unknown Semi-Structured Environments. Int. J. Robot. Res. 2010, 29, 485–501. [Google Scholar] [CrossRef]
Karaman, S.; Frazzoli, E. Sampling-Based Algorithms for Optimal Motion Planning. Int. J. Robot. Res. 2011, 30, 846–894. [Google Scholar] [CrossRef]
Kavraki, L.E.; Svestka, P.; Latombe, J.-C.; Overmars, M.H. Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces. IEEE Trans. Robot. Autom. 1996, 12, 566–580. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. Autom. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Kuutti, S.; Bowden, R.; Jin, Y.; Barber, P.; Fallah, S. A Survey of Deep Learning Applications to Autonomous Vehicle Control. IEEE Trans. Intell. Transp. Syst. 2021, 22, 712–733. [Google Scholar] [CrossRef]
Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-Driven Visual Navigation in Indoor Scenes Using Deep Reinforcement Learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3357–3364. [Google Scholar] [CrossRef]
Zhu, K.; Zhang, T. Deep Reinforcement Learning Based Mobile Robot Navigation: A Review. Tsinghua Sci. Technol. 2021, 26, 674–691. [Google Scholar] [CrossRef]
Huang, W.; Zhou, Y.; He, X.; Lv, C. Goal-Guided Transformer-Enabled Reinforcement Learning for Efficient Autonomous Navigation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 1832–1845. [Google Scholar] [CrossRef]
Prakash, A.; Chitta, K.; Geiger, A. Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. arXiv 2021, arXiv:2202.11101. [Google Scholar]
Wang, H.; Tan, A.H.; Nejat, G. NavFormer: A Transformer Architecture for Robot Target-Driven Navigation in Unknown and Dynamic Environments. IEEE Robot. Autom Lett. 2024, 9, 6808–6815. [Google Scholar] [CrossRef]
Jang, Y.; Baek, J.; Jeon, S.; Han, S. Bridging the Simulation-to-Real Gap of Depth Images for Deep Reinforcement Learning. Expert Syst. Appl. 2024, 253, 124310. [Google Scholar] [CrossRef]
Wang, P.; Li, W.; Ogunbona, P.; Wan, J.; Escalera, S. RGB-D-Based Human Motion Recognition with Deep Learning: A Survey. Comput. Vis. Image Underst. 2018, 171, 118–139. [Google Scholar] [CrossRef]
Lu, Y.; Song, D. Robust RGB-D Odometry Using Point and Line Features. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3934–3942. [Google Scholar] [CrossRef]
Li, Y.; Lyu, Q.; Yang, J.; Salam, Y.; Wang, B. Visual Target-Driven Robot Crowd Navigation with Limited FOV Using Self-Attention Enhanced Deep Reinforcement Learning. Sensors 2025, 25, 639. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Wu, J.; Wei, C.; Grech, R.; Ji, Z. Deep Reinforcement Learning for Localisability-Aware Mapless Navigation. IET Cyber-Syst. Robot. 2025, 7, e70018. [Google Scholar] [CrossRef]
Ugurlu, H.I.; Pham, X.H.; Kayacan, E. Sim-to-Real Deep Reinforcement Learning for Safe End-to-End Planning of Aerial Robots. Robotics 2022, 11, 109. [Google Scholar] [CrossRef]
Baltrusaitis, T.; Ahuja, C.; Morency, L.P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep Continuous Fusion for Multi-Sensor 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6639–6648. [Google Scholar] [CrossRef]
Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H.; Qi, S.; Institute, Z. FUTR3D: A Unified Sensor Fusion Framework for 3D Detection FUTR3D: A Unified Sensor Fusion Framework. arXiv 2022, arXiv:2203.10642. [Google Scholar]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. arXiv 2021, arXiv:2106.01345. [Google Scholar] [CrossRef]
Zeng, K.-H.; Zhang, Z.; Ehsani, K.; Hendrix, R.; Salvador, J.; Herrasti, A.; Girshick, R.; Kembhavi, A.; Weihs, L. PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators. arXiv 2024, arXiv:2406.00936. [Google Scholar]
Jin, S.; Wang, X.; Meng, Q. Spatial Memory-Augmented Visual Navigation Based on Hierarchical Deep Reinforcement Learning in Unknown Environments. Knowl.-Based Syst. 2024, 285, 111358. [Google Scholar] [CrossRef]
Wang, J.; Yu, Z.; Zhou, D.; Shi, J.; Deng, R. Vision-Based Deep Reinforcement Learning of Unmanned Aerial Vehicle (UAV) Autonomous Navigation Using Privileged Information. Drones 2024, 8, 782. [Google Scholar] [CrossRef]
Ge, L.; Zhou, X.; Li, Y.; Wang, Y. Deep Reinforcement Learning Navigation via Decision Transformer in Autonomous Driving. Front. Neurorobot. 2024, 18. [Google Scholar] [CrossRef]
Tan, S.; Zhou, D.; Shao, X.; Wang, J.; Sun, G. Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models. arXiv 2025, arXiv:2501.00421. [Google Scholar]
Wu, K.; Han, W.; Abolfazli Esfahani, M.; Yuan, S. Learn to Navigate Autonomously Through Deep Reinforcement Learning. IEEE Trans. Ind. Electron. 2022, 69, 5342–5352. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
Kahn, G.; Abbeel, P.; Levine, S. LaND: Learning to Navigate from Disengagements. IEEE Robot Autom. Lett. 2021, 6, 1872–1879. [Google Scholar] [CrossRef]
Kargar, E.; Kyrki, V. Vision Transformer for Learning Driving Policies in Complex and Dynamic Environments. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; pp. 1558–1564. [Google Scholar] [CrossRef]
Hansen, N.; Su, H.; Wang, X. Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
Xiao, Y.; Codevilla, F.; Gurram, A.; Urfalioglu, O.; Lopez, A.M. Multimodal End-to-End Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 537–547. [Google Scholar] [CrossRef]
Codevilla, F.; Muller, M.; Lopez, A.; Koltun, V.; Dosovitskiy, A. End-to-End Driving Via Conditional Imitation Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4693–4700. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2016. [Google Scholar] [CrossRef]

Figure 1. The proposed RGB-D GoT-SAC pipeline. An RGB-D frame is divided into patch tokens, which are prepended with a goal token representing the Goal Information. This sequence is processed by a Goal-guided Transformer Encoder to extract goal-relevant features for the decision-making module.

Figure 2. Simulation setup for ROS/Gazebo [10]. (a) Static indoor environment layout used for training and evaluation [10]. (b) An AgileX ScoutMini unmanned ground vehicle (UGV) equipped with RGB-D sensing for navigation tasks.

Figure 3. Training reward curves comparing 4RGB and RGB-D (Early Fusion) over 500 episodes. RGB-D shows faster convergence and higher stability across three runs.

Figure 4. Training reward curves (moving average, k = 20) across modality variants. RGB-D (early fusion) achieves the highest and most stable rewards, outperforming the 4RGB baseline, whereas G-D, 4G-D, and 4RGB-D lag behind, confirming that geometry alone or brute-force stacking is insufficient compared to RGB-D.

Figure 5. Comparison of success rate (bars) and average reward (line) across modality variants. RGB-D achieves the highest performance, surpassing the 4RGB baseline, whereas geometry-only (G-D) and stacked inputs (4G-D, 4RGB-D) underperform, confirming that appearance and geometry are complementary and that temporal stacking alone is insufficient.

Figure 6. Training reward curves (moving average, k = 20) for RGB-D fusion strategies. Early fusion (Variant A) yields the highest and most stable rewards, while parallel (Variant B) collapses and late fusion (Variant C) underperforms, highlighting the critical role of fusion stage design.

Figure 7. Comparison of fusion strategies (success rate and average reward). Early fusion (Variant A) significantly outperforms both parallel (Variant B) and late (Variant C) fusion, confirming that pixel-level integration of RGB and depth provides the most effective representation.

Figure 8. Training reward curves (moving average, k = 20) comparing Huang’s 4RGB baseline, RGB-D (early fusion), and tuned RGB-D. Optimization further stabilizes the training and amplifies the RGB-D’s margin over the 4RGB baseline, achieving the highest overall reward.

Figure 9. Performance comparison of Huang’s 4RGB baseline, RGB-D, and tuned RGB-D. The tuned configuration achieves the highest success rate and average reward, highlighting both the superiority of RGB-D over 4RGB and the potential for further gains with modest tuning.

Table 1. Unified key hyperparameter configuration. The listed parameters for the training algorithm, network architecture, and environment were applied consistently across all modality and fusion experiments detailed in this study.

Parameter	Value
Training
Actor Learning Rate	0.001
Critic Learning Rate	0.001
Discount Factor (γ)	0.999
Replay Buffer Size	20,000
Target Update Coefficient (τ)	0.005
Entropy Coefficient (α)	Automatically tuned
Gradient Clipping	1.0
Architecture
Transformer Blocks	2
Attention Heads	4
Patch Size	16 × 20
Environment
Max Linear Velocity ( $ν_{m a x})$	1.0 m/s
Max Angular Velocity ( $ω_{m a x})$	2.0 rad/s
Reward (Success)	+100
Reward (Collision)	−100

Table 2. Simulation Environment Configuration.

Parameter	Value
Environment Layout	1 static Indoor Map
Environment Size	14 m $\times$ 8 m
Obstacle Density	~24% (empirical)
Start/Goal Positions	Procedurally Generated
Minimum Start–Goal Distance	6.0 m
Goal Tolerance Radius	0.5 m
Maximum Episode Steps	300

Table 3. Evaluation of the baseline (4RGB) versus the proposed RGB-D under identical training conditions.

Method	Success Rate (%)	Avg. Reward
RGB-D	40.0 ± 4.2	+94.1 ± 15.8
4RGB	28.0 ± 3.5	+35.2 ± 10.1

Table 4. Evaluation of proposed RGB-D input modality against 4RGB, 4G-D, 4RGB-D, and G-D under identical training conditions.

Method	Success Rate (%)	Avg. Reward
RGB-D	40.0 ± 4.2	+94.1 ± 15.8
4RGB	28.0 ± 3.5	+35.2 ± 10.1
4G-D	26.0 ± 3.1	+25.4 ± 8.2
4RGB-D	16.0 ± 4.0	−6.7 ± 6.5
G-D	14.0 ± 2.5	−1.3 ± 5.0

Table 5. Comparison of RGB-D fusion strategies. Early fusion consistently outperforms parallel and late designs.

Fusion Strategy	Success Rate (%)	Avg. Reward
Variant A (Early Fusion)	40.0 ± 4.2	+94.1 ± 15.8
Variant B (Parallel Fusion)	6.0 ± 1.5	−10.4 ± 4.8
Variant C (Late Fusion)	16.0 ± 2.8	−7.2 ± 3.1

Table 6. Tuned RGB-D achieves the highest performance under modest hyperparameter tuning.

Method	Success Rate (%)	Avg. Reward
RGB-D (Tuned)	54.0 ± 3.8	+146.8 ± 20.5
RGB-D	40.0 ± 4.2	+94.1 ± 15.8
4RGB	28.0 ± 3.5	+35.2 ± 10.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

İnner, A.B.; Chachoua, M.E. Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation. Appl. Sci. 2026, 16, 1242. https://doi.org/10.3390/app16031242

AMA Style

İnner AB, Chachoua ME. Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation. Applied Sciences. 2026; 16(3):1242. https://doi.org/10.3390/app16031242

Chicago/Turabian Style

İnner, Alpaslan Burak, and Mohammed E. Chachoua. 2026. "Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation" Applied Sciences 16, no. 3: 1242. https://doi.org/10.3390/app16031242

APA Style

İnner, A. B., & Chachoua, M. E. (2026). Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation. Applied Sciences, 16(3), 1242. https://doi.org/10.3390/app16031242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. GoT-SAC Framework

3.2. Proposed RGB-D Modality

3.3. Supporting Ablations

3.4. Fusion Strategy Variants

3.5. Implementation Details

3.6. Evaluation Protocol

4. Experiments and Results

4.1. Simulation Environment

4.2. Training Protocol

4.3. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI