4.1. Simulation Environment
All experiments were conducted in a simulation framework using ROS (Noetic) with Gazebo Classic (Version 11), which provides physics-accurate interactions and reproducible navigation tasks. Training and inference were performed on a workstation equipped with an Intel Core i9-14900KF CPU, an NVIDIA RTX 5000 Ada Generation GPU (32 GB VRAM), and 64 GB of RAM. The platform is the AgileX ScoutMini, a differential-drive wheeled robot, selected for its widespread use in research on indoor navigation. The robot is actuated through the continuous control of the linear velocity and angular velocity .
The sensing suite consists of a forward-facing RGB camera paired with a pixel-aligned depth sensor, simulating commercial RGB–D devices such as Intel RealSense or ZED stereo cameras. Odometry provides the robot pose relative to the navigation goal, expressed in polar coordinates
. These observations were designed to directly mirror real-world sensor setups, ensuring that the simulation findings remain transferable in principle. Our simulation environment and the unmanned ground vehicle (UGV) model are depicted in
Figure 2.
Each navigation episode began by placing the robot at a random position and sampling a random goal location within the environment. The task was considered successful if the robot reaches its goal within a specified tolerance radius. Episodes terminate upon goal reaching, collision with an obstacle, or exceeding the maximum step limit of 300.
To ensure a robust and fair evaluation, all experiments were conducted in procedurally generated environments with randomized layouts containing static obstacles such as blocks, walls, and barriers. Both training and evaluation layouts were drawn from the same distribution, ensuring that all modality and fusion strategies were compared under identical task conditions, thereby facilitating a direct and principled comparison with the baseline framework by Huang et al. [
10]. The key parameters of the training and evaluation environment are summarized in
Table 2.
4.3. Results and Discussion
Based on the training protocol, we evaluated the learning performance of the proposed RGB-D (Early Fusion) method in comparison with the canonical 4RGB baseline. The evaluation focused on convergence speed, training stability, and overall navigation performance under identical experimental conditions.
Figure 3 illustrates the moving average (window size
) of episodic rewards over 500 training episodes for both the 4RGB baseline and the proposed RGB-D (Early Fusion) method. The RGB-D agent not only converges faster but also achieves higher final performance and exhibits lower variance across three independent runs. A direct comparison between Huang’s 4RGB baseline and our proposed RGB-D formulation demonstrates the clear benefit of incorporating explicit geometry. Evaluated using the key metrics of success rate and average reward, the RGB-D modality reaches 40.0% ± 4.2% success and a +94.1 ± 15.8 reward. This significantly surpasses the 4RGB baseline, which achieved 28.0% ± 3.5% success and a +35.2 ± 10.1 reward. This corresponds to a 12% increase in the success rate and nearly a threefold improvement in the reward. The learning curves, presented in
Figure 3, further reveal that RGB-D converges more rapidly, attains higher final performance, and exhibits lower variance across experimental runs. These comparative results are summarized in
Table 3. In addition to its superior performance, the RGB-D modality offers a significant computational advantage. While the core Transformer architecture was held constant for a fair comparison, the initial patch embedding layer differs significantly. The 4RGB baseline flattens each patch into a 3840-dimensional vector (
), whereas our RGB-D modality’s patches are flattened into a 1280-dimensional vector (
). As both are projected to the same 256-dimensional embedding space, the 4RGB model’s embedding layer contains 983,296 parameters (calculated as
). In contrast, our RGB-D model’s layer contains only 327,936 parameters (calculated as
). This 66.7% reduction in parameters for the input-processing stage directly implies a lower computational load and faster inference capability, highlighting another key practical advantage of our approach over temporal stacking.
To systematically dissect the factors behind RGB-D’s superior performance and to clarify the distinct roles of appearance, geometry, and temporal redundancy, we conducted a comprehensive ablation study. This study evaluated five modality variants: the proposed RGB-D, the 4RGB baseline, Gray and Depth (G-D) to isolate geometry, Four Stacked Gray and Depth (4G-D) to test temporal compensation, and Four Stacked RGB and Depth (4RGB-D) to assess the impact of high-dimensional temporal stacking.
Figure 4 presents the training reward curves for all five modality variants, placing the direct comparison from
Figure 3 (our proposed RGB-D vs. the 4RGB baseline) into the broader context of the full ablation study. These curves are smoothed with a moving average (
) to visualize learning stability and convergence over 500 episodes. The plot provides clear visual evidence of the RGB-D agent’s robust learning trajectory, which consistently achieves the highest rewards. In stark contrast, the erratic and low-reward curves of the G-D and 4RGB-D variants illustrate the instability caused by insufficient semantic cues and excessive input dimensionality, respectively. Even with the addition of temporal cues, the 4G-D variant fails to match the performance of color-enabled modalities, highlighting that motion information cannot fully substitute for rich appearance features.
Figure 5 provides a visual summary of the final navigation performance by comparing the success rate (bars) and the average reward (line) across the five tested modality variants: RGB-D, 4RGB, Gray and Depth (G-D), Four Stacked Gray and Depth (4G-D), and Four Stacked RGB and Depth (4RGB-D). The figure confirms that RGB-D achieved the highest performance, clearly surpassing the 4RGB baseline. Geometry-only inputs (G-D) and stacked temporal inputs (4G-D and 4RGB-D) are shown to underperform, establishing evidence that appearance and geometry are complementary and that temporal stacking alone is inadequate.
The ablation experiments, with results presented in
Figure 4 and
Figure 5 and summarized in
Table 4, further clarified the role of the appearance, geometry, and temporal redundancy. The ablation experiments confirmed that appearance and geometry are complementary and essential. Geometry alone, as tested in the gray-and-depth (G-D) variant, proved insufficient. By removing color semantics such as wall–floor contrast and object boundaries, the policy struggled to navigate robustly, achieving only 14.0% success and an average reward of −1.25 with unstable convergence. We then investigated if temporal redundancy could compensate. Stacking four G-D frames (4G-D) did improve stability, increasing the success rate to 26.0% and the reward to +25.4. This confirms that motion cues from temporal stacking provide useful, stabilizing information. However, even with this improvement, 4G-D failed to approach the performance of the color-enabled variants, demonstrating that temporal cues cannot substitute for the rich semantic information provided by color. Finally, we investigated whether adding temporal data to our best modality would yield further gains. While the four-stacked RGB-D (4RGB-D) variant combines appearance, geometry, and temporal cues, the resulting 16-channel input created a dimensional explosion. The patch embeddings grew to a size of 5120, which destabilized the training process. Beyond this dimensional challenge, other factors likely contributed to this collapse. Our lightweight Transformer (2 blocks, 4 heads) was held constant across all experiments for a fair comparison, and it likely lacked the parameter capacity to effectively process such a high-dimensional input vector. Furthermore, this large input made the optimization problem significantly harder. The fixed SAC hyperparameters (such as the learning rate) used across all experiments were likely ill-suited for this more complex task, leading to the observed training instability. Consequently, the performance degraded markedly to just 16.0% success and a −6.7 reward. This outcome demonstrates that brute-force temporal stacking is computationally inefficient and ultimately less effective than a principled, single-frame fusion of complementary modalities.
Next, we examined how the stage of sensor fusion impacts performance.
Figure 6 displays the training reward curves, calculated using a moving average (k = 20), for the three investigated fusion strategies: Early Fusion (Variant A), Parallel Encoding (Variant B), and Late Fusion (Variant C). The visualization highlights the critical role of fusion stage design, demonstrating that Early Fusion (Variant A) consistently yields the highest and most stable rewards throughout the 500 training episodes. In sharp contrast, Parallel Encoding (Variant B) exhibits an early and catastrophic performance collapse. The Late Fusion (Variant C) strategy shows a smoother but significantly weaker trajectory, converging below both Variant A and the Parallel Encoding collapse point.
Figure 7 presents a visual comparison of the final policy performance for the fusion strategies—Early Fusion (Variant A), Parallel Encoding (Variant B), and Late Fusion (Variant C)—by showing the success rate and average reward metrics. The figure clearly illustrates that Early Fusion (Variant A) significantly outperforms both the parallel and late fusion designs. This performance differential confirms that the pixel-level integration of RGB and depth provides the most effective cross-modal representation for the Transformer.
Table 5 quantifies the final performance metrics for the fusion strategies: Variant A (Early Fusion), Variant B (Parallel Fusion), and Variant C (Late Fusion). Early Fusion (Variant A) is numerically confirmed as superior, achieving 40.0% ± 4.2% success and a +94.1 ± 15.8 average reward. Parallel Fusion (Variant B) yielded the lowest performance, with 6.0% ± 1.5% success and a −10.4 ± 4.8 reward. Late Fusion (Variant C) achieved an intermediate but poor result of 16.0% ± 2.8% success and a −7.2 ± 3.1 reward. The table confirms the consistent underperformance of both parallel and late designs relative to the early fusion strategy.
Early fusion (Variant A), which concatenates RGB and depth at the pixel level before the patch tokenization, proved to be the most effective strategy, achieving 40.0% success and a +94.1 reward. In contrast, parallel encoding (Variant B), which doubles the token sequence length to 129, severely burdened the lightweight transformer. It is important to note that this performance collapse may reflect the fixed capacity of our encoder being insufficient for the longer sequence, rather than an inherent flaw in the parallel fusion concept itself under a different computational budget. This architectural mismatch led to unstable learning and a performance collapse to just 6.0% success and a −10.4 reward. Finally, late fusion (Variant C), which compresses depth into a scalar statistic per patch, provided smoother but weaker training. By discarding fine-grained geometric details, it converged to only 16.0% success and a −7.2 reward. These results provide a clear conclusion: for self-attention to be effective, cross-modal correlations between appearance and geometry must be available at the pixel or patch level. The learning dynamics of these strategies are shown in the training reward curves in
Figure 6, with a summary of their final performance metrics presented in
Figure 7 and
Table 5.
The final stage of our systematic investigation explored performance optimization through hyperparameter tuning.
Figure 8 illustrates the training reward curves, calculated using a moving average (k = 20), comparing Huang’s 4RGB baseline, the standard RGB-D (early fusion) variant, and the tuned RGB-D configuration. The visual data confirms that optimization—specifically halving the actor and critic learning rates—successfully mitigated late-episode instability in critic losses, resulting in improved training stability and convergence. The tuned model clearly achieves the highest overall reward, further amplifying the performance margin of RGB-D over the 4RGB baseline.
Figure 9 provides a final performance comparison across Huang’s 4RGB baseline, the original RGB-D policy, and the tuned RGB-D configuration. The figure distinctly shows that the tuned configuration achieves the highest success rate and average reward. This result underscores the inherent superiority of the RGB-D modality over 4RGB and highlights the potential for achieving substantial further gains through modest hyperparameter tuning.
Table 6 provides the definitive numerical quantification of the performance gains achieved following hyperparameter optimization. The table confirms that the RGB-D (Tuned) configuration secured the highest results across all experiments, achieving a 54.0% ± 3.8% success rate and a +146.8 ± 20.5 average reward. This optimized result represents a significant gain over the original RGB-D performance (40.0% ± 4.2% success, +94.1 ± 15.8 reward) and establishes a substantial lead over the 4RGB baseline (28.0% ± 3.5% success, +35.2 ± 10.1 reward).
By halving the actor and critic learning rates from 0.001 to 0.0005, we mitigated the late-episode instability in critic losses and improved convergence. This tuned RGB-D configuration achieves 54.0% success and +146.8 reward, outperforming both the original RGB-D and 4RGB baseline by a substantial margin. These findings demonstrate that RGB-D surpasses 4RGB under aligned conditions and allows further performance gains with modest tuning.
Figure 8 illustrates the improved training stability of the tuned model, and
Figure 9 compares its final performance against the baselines. The significant gains are also quantified in
Table 6.
Together, these results establish RGB-D early fusion as a principled and superior successor to 4RGB. Ablation studies confirm that appearance and geometry are complementary, temporal redundancy is insufficient, and early fusion is the most effective integration strategy. Moreover, the tuned RGB-D configuration demonstrated that the framework can be further stabilized and improved, providing a strong foundation for future sim-to-real transfers.