4.1. Experimental Setup
To comprehensively evaluate the performance of Mamba-DQN, experiments were conducted in three distinct environments: Highway-fast-v0, LunarLander-v3, and Atari 2600 Pong. The Highway-fast-v0 environment, featuring frequent state transitions in autonomous driving simulations, served as the primary benchmark for structural and hyperparameter analyses. In contrast, the LunarLander-v3 environment, characterized by sparse rewards and high control complexity, was employed to assess the model’s generalization capability.
Smoothed total reward and TD loss were used as the primary evaluation metrics, where smoothing was applied for visualization purposes. For Highway-fast-v0, the performance was averaged over steps 200 to 20,000 to exclude the initial exploration phase, while for LunarLander-v3, hyperparameters derived from the Highway-fast-v0 environment were applied without modification, and the performance was averaged over steps 10,000 to 200,000.
To further evaluate the generalization capability of Mamba-DQN in a visually rich and dynamic environment, experiments were conducted in the Atari 2600 Pong environment. Pong is characterized by dense rewards and fast-paced two-dimensional spatiotemporal dynamics. All models were trained for 1,500,000 steps and were used to assess the generalization performance in a pixel-based visual input environment.
All models were trained five times independently under identical hyperparameter settings, and the average results were reported to ensure statistical reliability. Experiments were conducted on an Ubuntu 22.04 system equipped with an NVIDIA RTX 3050 Laptop GPU (MSI Sword 15 A12UC, MSI, Taipei, Taiwan), Mamba-SSM version 2.2.5, Causal-Conv1D version 1.5.2, and PyTorch version 1.12. The random seed was fixed to 42 to ensure reproducibility.
4.2. Comparative Evaluation with Baseline Models
In this section, the policy learning performance of Mamba-DQN is quantitatively compared to conventional DQN-based models. For each model, gradient clipping thresholds of 0.1, 0.5, and 1.0 were tested, and five independent training runs were conducted under identical environmental conditions. The average smoothed reward and TD loss were computed for each setting, and the clipping coefficient that yielded the best performance was adopted as the optimal value for each model. Based on these optimal conditions, the policy learning efficiency and stability of Mamba-DQN were compared with those of the baseline models.
Table 2 summarizes the performance results for each model in the Highway-fast-v0 environment. Mamba-DQN outperformed all baseline models in terms of both average smoothed reward and TD loss. Specifically, with a gradient clipping coefficient of 0.5, Mamba-DQN achieved an average reward of 20.9864, showing a 26.1% improvement over DQN, and a TD loss of 0.0207, indicating an 84.9% reduction relative to DQN.
These results suggest that the structural design of Mamba-DQN, which effectively captures temporal dependencies and selectively integrates important experiences into learning, contributes positively to both convergence speed and performance.
Figure 4 presents the smoothed reward convergence curves for each model. The results show that Mamba-DQN achieves faster reward improvement during the early training stages and maintains the highest average reward throughout the training process.
Figure 5 presents the TD loss trends for each model in the Highway-fast-v0 environment. Mamba-DQN maintained lower TD errors from the early training phase and consistently exhibited reduced TD loss values throughout training, reflecting improved temporal representation and policy updates enabled by full state sequence re-encoding.
To further evaluate the generalization capability and environment-agnostic performance of the proposed architecture, additional experiments were conducted in the LunarLander-v3 environment using the same unmodified hyperparameter settings. This environment is characterized by sparse rewards and complex physical dynamics, making it suitable for assessing temporal representation capability and learning stability.
Table 3 summarizes the performance of all models in the LunarLander-v3 environment under fixed hyperparameter settings. Mamba-DQN (Clip 1.0) achieved the highest maximum reward (236.56), indicating a significantly higher performance ceiling than the baseline models. However, this peak performance came at the cost of stability; a single outlier run with a large negative reward heavily skewed the mean, resulting in a performance distribution with high variance (Std: 109.78).
Figure 6 presents the smoothed reward convergence curves for each model in the LunarLander-v3 environment. The results indicate that Mamba-DQN achieves faster reward improvement during the early training phase and reaches convergence after approximately 175,000 steps.
Figure 7 shows the TD loss trends for each model. The results show that Mamba-DQN maintains lower TD loss after convergence.
Previous studies have reported that in the LunarLander-v3 environment, DQN-based models typically require at least 500,000 training steps to achieve convergence, especially when the initial policy performance is low [
26,
27].
In comparison, Mamba-DQN achieved policy convergence and competitive reward levels within approximately 200,000 training steps in the same environment. These results indicate that the proposed time-series encoding architecture improves state representation and enables effective priority-based sample selection during the early stages of training, thereby facilitating more efficient policy learning with reduced sample requirements.
To further assess the generalization capability of Mamba-DQN in a high-dimensional visual domain, additional experiments were conducted in the Atari 2600 Pong environment. Atari 2600 Pong is characterized by dense rewards and fast-paced two-dimensional spatiotemporal dynamics, requiring both effective temporal modeling and visual feature extraction. For all models, a CNN-based preprocessing module was integrated to process raw pixel observations, while all other hyperparameter settings remained identical to those used in the previous experiments.
For the pixel-based Atari 2600 Pong environment, the gradient clipping coefficient was fixed at 1.0. This decision was guided by results from the other two environments, where a coefficient of 1.0 consistently yielded training and competitive performance. As the Pong experiment was intended to evaluate the proposed architecture’s generalization to high-dimensional visual inputs, we selected a representative and well-performing setting rather than conducting exhaustive hyperparameter tuning.
Although this setting was not individually optimized for each model, the results demonstrate that Mamba-DQN retains its performance advantages in a visually rich and dynamic environment. Each model was trained for 1,500,000 steps, and evaluation was based on smoothed average reward and TD loss. These results are presented in
Table 4.
Figure 8 presents the smoothed reward convergence curves for each model in the Atari 2600 Pong environment. The results indicate that Mamba-DQN achieves faster reward improvement during the early training phase and reaches convergence after approximately 900,000 steps.
Figure 9 shows the TD loss trends for each model in the Atari 2600 Pong environment. The results demonstrate that Mamba-DQN maintains lower TD loss after convergence.
Prior studies have reported that in the Atari 2600 Pong environment, DQN-based models typically require up to 50 million training steps [
7] to achieve convergence. In contrast, Mamba-DQN achieved faster policy convergence and competitive reward levels within approximately 1.5 million training steps in the same environment. These results indicate that the proposed architecture improves state representation and effectively facilitates policy learning.
The experiments in each proposed environment were designed to evaluate the generalization capability and environment-agnostic applicability of the proposed architecture. Each environment, characterized by sparse reward signals and high control complexity, was employed to assess the ability of the proposed architecture and learning method to achieve policy convergence under such conditions.
In summary, the proposed Mamba-DQN demonstrated consistent performance across environments with different reward densities and state transition dynamics. These results confirm that the proposed time-series encoding-based learning strategy facilitates faster policy convergence and enhances the generalization capability of reinforcement learning policies.
4.3. Ablation Study on the Structural Components of Mamba-DQN
The ablation experiments were conducted with different component removal settings for each environment.
For the 1D signal-based environments, namely LunarLander-v3 and Highway-fast-v0, two ablated variants were evaluated: (i) a model without PER [
28] and (ii) a model without the Mamba-SSM encoder. The performance of these variants was compared against the full Mamba-DQN containing all components.
For the 2D image-based Atari 2600 Pong environment, two different ablations were performed: (i) a model without the Mamba-SSM encoder and (ii) a model in which the replay buffer was modified to store only the last state of each sequence while still providing sequence inputs to the network. These variants were compared with the full Mamba-DQN to assess the impact of each component.
Table 5 shows that removing the Mamba-SSM encoder leads to complete failure in learning, as indicated by rewards consistently near
. Modifying the replay buffer to store only the last state of each sequence results in a significant performance drop (
) and a large negative effect size (Cohen’s
), confirming the importance of full sequence context in image-based environments.
As shown in
Table 6, removing PER causes a substantial degradation in performance, with the average reward dropping to
and
compared to the full model.
The removal of the Mamba-SSM encoder also reduces the performance significantly (), highlighting that both prioritized sampling and the encoder contribute to 1D signal-based tasks.
Table 7 shows that the removal of the PER results in only a minor statistically insignificant change (
), indicating that the PER has less influence in this environment. However, removing the Mamba-SSM encoder produces a significant decrease in performance (
), confirming the encoder’s role in achieving higher rewards in high-speed navigation tasks.
Across all environments, the Mamba-SSM encoder emerges as the most critical component, with its removal consistently resulting in statistically significant performance degradation and large negative effect sizes.
While the PER contributes substantially in environments with sparse or high-variance rewards such as LunarLander-v3, its impact is less pronounced in highway-driving scenarios.
In the image-based Atari 2600 Pong task, both the encoder and the use of a full sequence replay buffer are essential for successful learning, with their removal leading to severe performance drops or complete training failure.
These findings validate that both architectural design and sampling strategy are environment-dependent and that the proposed full Mamba-DQN configuration offers the most robust performance across diverse task types.
4.4. Summary and Implications
Based on the comprehensive experimental evaluation, including both comparative and ablation studies, the results demonstrate that Mamba-DQN achieves strong performance, often outperforming baseline models across diverse environments. Although the performance gains were not universally statistically significant, Mamba-DQN exhibited clear advantages, with most configurations showing statistically significant improvements () in both smoothed average reward and TD loss.
These performance gains stem from the architectural characteristics of Mamba-DQN. The Mamba-based time-series encoder efficiently compresses temporal dependencies, even with short input sequences, while the use of full state sequence re-encoding at each training step ensures temporal representations remain consistent with the current network parameters. This design promotes coherent Q-value estimation and mitigates representational drift during training.
Consequently, Mamba-DQN shows strong applicability to dynamic environments with frequent state transitions, where temporal alignment and accurate value estimation are critical for policy convergence. In summary, the proposed architecture integrates temporal abstraction, structural consistency, and training efficiency, offering strong potential for effective and transferable reinforcement learning performance.