1. Introduction
Infrared small target detection (IRSTD) is a cornerstone technology for ensuring safety and maintaining strategic advantage in high-stakes domains such as national defense, aviation safety, and autonomous systems [
1,
2,
3,
4]. The practical urgency of this field is underscored by its critical applications: detecting stealth aircraft or incoming missile threats in military early warning systems, identifying unmanned aerial vehicles (UAVs) in restricted airspace, and spotting potential obstacles for safe navigation in maritime surveillance and autonomous driving, especially under low-visibility conditions. In these scenarios, targets are often small, faint, and distant, appearing as mere pixel clusters with no discernible shape or texture. They are easily submerged in complex backgrounds containing heavy cloud clutter, sea-sky lines, or urban thermal noise. The failure to reliably detect such targets can have catastrophic consequences, making the development of highly sensitive and robust detection algorithms not just an academic pursuit, but a practical imperative.
Figure 1 illustrates infrared small targets in real-world scenarios, targets typically occupy only a few pixels, are embedded in heterogeneous, nonstationary backgrounds, and exhibit low signal-to-clutter ratios (SCR). The lack of color and texture cues, frequent defocus or motion blur, and pronounced scale imbalance further complicate the problem, causing weak target signals to be easily submerged by structured clutter.
Before the widespread adoption of deep learning, single-frame IRSTD relied largely on model-driven techniques, including spatial-frequency filtering, human visual system (HVS)-inspired contrast mechanisms, and low-rank/sparse decomposition [
5,
6,
7,
8]. While these approaches offer interpretability and modest computational cost, their effectiveness hinges on handcrafted priors and narrow operating assumptions. As reported across remote sensing and surveillance studies, robustness degrades when scene statistics drift, clutter changes, or targets deviate from the assumed models, limiting practical deployment [
9].
With the emergence of public datasets, IRSTD has increasingly been formulated as a fine-grained segmentation task, where U-shaped encoder–decoder networks with skip connections preserve spatial detail while learning semantic abstractions [
10]. Numerous variants enhance cross-layer interaction and feature utilization through asymmetric top-down/bottom-up modulation, dense nested aggregation, and attention-guided fusion [
11,
12]. Nonetheless, persistent limitations are observed in single-frame pipelines. First, repeated downsampling suppresses weak small-target activations and disrupts hierarchical information flow, impeding the consolidation of fragmentary evidence into coherent semantics [
12]. Second, the semantic gap between encoder outputs and decoder inputs is insufficiently bridged by naïve skips or rudimentary fusion, allowing clutter to propagate and diminishing boundary precision for tiny targets [
9]. Third, long-range contextual perception remains unreliable in deeper layers: background continuity is under-modeled, target–background similarity is high, and spurious responses elevate false alarms [
13].
To alleviate the limited contextual field of CNNs, Transformer-based designs were introduced, but their quadratic computational complexity (O(N2)) creates a significant accuracy-efficiency dilemma, especially for high-resolution imagery typical in IRSTD. This calls for a new architectural paradigm. The recently proposed Mamba architecture, a State Space Model (SSM) with linear complexity (O(N)), presents a compelling alternative. However, to merely view Mamba as a more efficient substitute would be to overlook its fundamental functional novelty. The Transformer’s self-attention performs a static, all-to-all comparison, whereas Mamba operates via a dynamic, sequential scan with content-aware gating. This represents a paradigm shift from static global comparison to dynamic, selective global modeling—a more robust capability for distinguishing faint targets from deceptive backgrounds.
However, simply adopting this new paradigm is insufficient. The fundamental challenges of IRSTD demand a systematic architectural response. Specifically, three critical bottlenecks must be addressed: (1) Early-stage signal preservation: weak target signatures are easily lost to signal drowning when standard convolutions uniformly process all pixels. (2) Global scene disambiguation: at the network’s deepest point, limited receptive fields fail to resolve “deceptive clutter” that is locally indistinguishable from targets. (3) Semantic-aware fusion: during upsampling, the semantic gap between high-level features and spatial details allows clutter to propagate through naïve skip connections, diminishing boundary precision.
Guided by these principles, we develop MixMambaNet, a Mamba-enhanced IRSTD framework that systematically addresses each bottleneck through three synergistic modules. First, the Perception-aware Hybrid Encoder (PHE) replaces conventional residual blocks to combat signal drowning by decoupling local perceptual attention from mixed pixel-channel attention, thereby strengthening minute structures while retaining image-wide statistics. Second, the MixMamba Bottleneck (MMB) leverages selective-scan 2D state-space modeling to provide efficient long-range reasoning with linear complexity, resolving global ambiguities that CNNs cannot capture and Transformers address only at prohibitive cost. Third, the Non-local Mamba Aggregation (NMA) module substitutes standard skip fusion to bridge the semantic gap through adaptive, context-aware feature fusion that aligns cross-scale semantics and filters structured clutter. The resulting U-shaped network employs deep supervision for stable optimization and improved discriminability across decoder stages. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1k show consistent gains over prevailing CNN, Transformer, and hybrid approaches, including SCTransNet, with competitive efficiency [
9,
11,
14,
15].
The contributions of this work are summarized as follows:
Conventional residual blocks are replaced with a perception-aware hybrid encoder, where local perceptual attention is coupled with mixed pixel–channel attention in multi-branch paths so that fine, low-contrast structures are amplified while global context is injected early to suppress structured distractors.
Dense pre-enhancement is integrated with a selective-scan 2D state-space core (Mamba) and a lightweight hybrid-attention tail to realize linear-complexity long-range reasoning that is better matched to the weak-signal characteristics of IRSTD than quadratic self-attention.
Standard skip fusion is replaced with a non-local Mamba aggregation module, in which DASI-style multi-scale integration, SS2D-driven selective scanning, and adap-tive non-local enhancement are combined to align cross-scale semantics and to mod-el background continuity, thereby reducing spurious responses in cluttered regions.
2. Related Work
The detection of infrared small targets (IRSTD) has evolved significantly, transitioning from traditional model-driven methods to advanced data-driven deep learning paradigms. This section reviews this evolution, critically analyzing the limitations of existing approaches and contextualizing the architectural innovations of our proposed MixMambaNet.
2.1. Traditional Model-Driven Methods
Early IRSTD research was dominated by model-driven approaches that leveraged hand-crafted features derived from the assumed physical properties of targets. These methods can be broadly classified into filtering-based and Human Visual System (HVS) inspired techniques. Filtering-based methods, such as the classic Top-Hat transform [
16], operate by subtracting a morphologically opened (background) image from the original, thereby isolating bright regions. While straightforward, their efficacy is critically dependent on the size and shape of the structural element, which struggles to adapt to variations in target scale and background complexity, often leading to significant background residuals.
Inspired by how humans perceive salient objects, HVS-based methods like the Local Contrast Method (LCM) [
7] and its numerous variants were developed. These methods compute a saliency map based on the contrast between a central patch and its surrounding pixels. While computationally efficient and intuitive, their reliance on local statistics makes them highly susceptible to false alarms triggered by strong edges, corner points, or even pixel-level noise that mimics high local contrast. The fundamental deficiency uniting this entire class of traditional methods is their reliance on fixed, low-level priors. They lack the capacity to learn from data, rendering them brittle and unable to generalize across the diverse and dynamic scenes encountered in real-world applications.
2.2. CNN-Based Methods
The advent of deep learning, particularly Convolutional Neural Networks (CNNs), marked a paradigm shift in IRSTD. By learning hierarchical feature representations directly from data, CNN-based methods have consistently outperformed their traditional counterparts. Seminal works like ACM [
11] and ALCNet [
17] established the effectiveness of the U-Net-style encoder–decoder architecture, often enhanced with attention mechanisms or contextual modules to refine feature maps. Subsequent developments, including DNA-Net [
12] and UIU-Net [
14], introduced more intricate feature fusion strategies and interactive connections to bridge the semantic gap between the encoder and decoder paths.
Despite these advancements, a fundamental and inherent limitation persists across all CNN-based architectures: the locality of the convolution operation. As information propagates through deeper layers, the effective receptive field, while growing, remains fundamentally constrained. This makes it challenging for the network to model the long-range spatial dependencies necessary to perform global scene disambiguation. For instance, a CNN may struggle to differentiate a true, isolated target from a small, bright patch that is part of a larger, distant clutter structure (e.g., a building’s window frame). This locality constraint is the primary cause of “deceptive clutter” false alarms and is a core challenge that our work aims to address.
2.3. Advanced Methods
To overcome the inherent locality of CNNs, a significant body of recent research has focused on designing architectures capable of capturing global context. This pursuit has primarily branched into two main avenues: integrating Transformer-based modules and exploring other non-local modeling paradigms.
Transformer-based Hybrids: This is currently the most prominent approach. Methods like SCTransNet [
9] incorporate blocks from hierarchical Transformers (e.g., Swin Transformer) into the U-Net backbone. The self-attention mechanism within these blocks allows the model to compute relationships between all feature tokens, thereby achieving a global receptive field. Similarly, TransUNet [
18] demonstrates that combining a Transformer-based encoder for global context with a CNN-based decoder for precise localization can yield strong performance in segmentation tasks. While this hybrid strategy has proven effective for reducing false alarms caused by structured, non-local clutter, it comes at a steep price: the quadratic computational complexity (O(N
2)) of the self-attention mechanism with respect to the number of image tokens. This makes pure or heavy Transformer models inefficient, particularly for high-resolution infrared images, hindering their practical deployment in resource-constrained or real-time scenarios.
Other Non-Local Modeling Paradigms: Beyond Transformers, researchers have explored other ways to model long-range dependencies. For instance, Non-local Neural Networks [
19] introduce a generic “non-local” block that computes the response at a position as a weighted sum of features at all positions. While this successfully captures global dependencies, much like a single self-attention layer, it also suffers from high computational cost and can be challenging to integrate efficiently. More recently, new paradigms continue to emerge. For example, MLEDNet [
1] proposes a dense nested network that assists detection with multi-directional learnable edge information. While innovative, its primary focus on local edge priors may not fully resolve the challenge of suppressing large-scale, non-edge-based clutter, which requires a more holistic scene understanding.
A common thread connects all these advanced methods: a persistent and challenging trade-off between the ability to model global context and the demand for computational efficiency. CNNs are efficient but local; Transformers and non-local blocks are global but computationally expensive. This creates a critical and unmet need for an architecture that can achieve comprehensive global context modeling with high computational efficiency.
2.4. The Emergence of Mamba-Based Methods
The recent introduction of State Space Models (SSMs), particularly the Mamba architecture [
20], has presented a highly promising solution to the efficiency-performance trade-off. With their ability to model long-range dependencies in linear time, Mamba-based models are rapidly emerging as a compelling alternative to Transformers in the IRSTD field.
A notable contemporary work in this direction is MOU-Mamba [
21], which proposes a Multi-Order U-shape Mamba architecture. Its core innovation lies in redesigning the Mamba block itself to be more adept at visual tasks, introducing a Multi-Order 2D-Selective-Scan (MO-SS2D) module to capture dependencies at various scales. This approach focuses on enhancing the internal capabilities of the Mamba block, making it a more powerful, self-contained unit for feature extraction.
This design philosophy, however, presents a clear architectural trade-off. By tasking a single, complex block with handling both local and global feature processing, it potentially overlooks the distinct advantages and proven efficiency of specialized modules. The attempt to create a versatile, “all-in-one” Mamba block leads to a less direct architectural strategy, where the inherent strengths of different computational paradigms, such as the local feature expertise of CNNs, are not explicitly leveraged. This points to an alternative and potentially more efficient architectural direction: a synergistic framework where different modules collaborate based on their specialized strengths.
Our proposed MixMambaNet directly addresses this architectural question by opting for a synergistic, hybrid approach. Instead of modifying the core Mamba block, we leverage its primary strength—efficient global modeling—and integrate it within a carefully designed framework where it collaborates with CNNs. This clear division of labor is our core advantage: CNNs for Local Features: We retain CNNs for what they do best: efficient and robust extraction of local spatial features. Mamba for Global Context: We strategically deploy Mamba modules at critical points in the network (PHE, MMB, NMA) to establish long-range dependencies and perform global scene disambiguation. This architectural philosophy of synergistic delegation, rather than monolithic block enhancement, allows MixMambaNet to achieve a superior balance of performance, parameter efficiency (10.36 M vs. MOU-Mamba’s 13.44 M), and architectural clarity, establishing a new state-of-the-art.
4. Experimental
4.1. Evaluation Metrics
To comprehensively evaluate the performance of our proposed model and conduct a fair comparison with state-of-the-art methods, we employ a set of widely recognized metrics. These metrics are categorized into two groups: those assessing detection and segmentation accuracy, and those evaluating model complexity and computational efficiency.
4.1.1. Accuracy and Detection Performance Metrics
Intersection over Union (IoU): As a primary metric for segmentation quality, IoU quantifies the spatial overlap between the predicted segmentation map (
) and the ground-truth mask (
). It is calculated as the ratio of the area of their intersection to the area of their union. A higher IoU score signifies a more precise alignment of the predicted target shape with the actual target. It is formally defined as follows:
where TP, FP, and FN represent the counts of true positive, false positive, and false negative pixels, respectively.
Normalized Intersection over Union (nIoU): To mitigate the potential bias in the standard IoU [
11] metric, where datasets containing targets of varying sizes might disproportionately influence the average score, we also adopt the nIoU. This metric calculates the IoU for each target individually and then averages these scores across all targets present in the dataset. This ensures that each target, regardless of its size, contributes equally to the final performance score.
F-measure: The F-measure provides a balanced assessment of a model’s performance by computing the harmonic mean of Precision and Recall. Precision measures the accuracy of the positive predictions (i.e., the proportion of correctly identified target pixels among all pixels predicted as targets), while Recall measures the model’s ability to identify all actual target pixels. The F-measure is particularly useful in scenarios with significant class imbalance, a common characteristic of small target detection. The formulas are as follows:
Probability of Detection (
): This metric evaluates the model’s efficacy at the target level rather than the pixel level. It represents the ratio of correctly detected targets (
) to the total number of actual targets (
) in the dataset. A target is considered correctly detected if the centroid deviation between its predicted segmentation and the ground-truth mask is within a predefined pixel threshold (Follow the same type of paper [
9,
12] and set it to 3). Pd = Number of Correctly Detected Targets Total Number of Actual Targets.
False-Alarm Rate (
): The False-Alarm Rate is a critical indicator of a model’s robustness against background clutter. It measures the proportion of background pixels that are incorrectly classified as target pixels (
) relative to the total number of pixels (
) in the entire image. A lower
value indicates a stronger capability to suppress background noise and reduce spurious detections.
4.1.2. Model Complexity and Efficiency Metrics
- 6.
Parameters (Params): This metric refers to the total number of learnable parameters (i.e., weights and biases) within the network. It serves as an indicator of the model’s size and static memory requirements. The number of parameters is typically reported in millions (M).
- 7.
Floating Point Operations (FLOPs): FLOPs quantify the computational complexity of a model. This metric represents the total count of floating-point arithmetic operations required to process a single input image in a forward pass. It is a hardware-independent measure of the model’s theoretical inference speed, commonly expressed in GigaFLOPs (G).
4.2. Experiment Settings
Datasets. Our empirical evaluation is conducted on three widely used public benchmarks for infrared small target detection: NUAA-SIRST [
11], NUDT-SIRST [
12], and IRSTD-1k [
26], which contain 427, 1327, and 1000 images, respectively. To ensure a fair and reproducible comparison, we adhere to the standard data partitioning protocols established in their original publications. Specifically, the training and test splits for NUAA-SIRST and NUDT-SIRST follow the methodology proposed by Dai et al., while the splits for IRSTD-1k are based on the work of Wang et al.
Implementation Details. All experiments were conducted within the PyTorch 2.1 deep learning framework. The model was trained and evaluated on a workstation equipped with a single NVIDIA GeForce RTX 3090 GPU, an Intel Core i7-12700KF CPU, and 32 GB of RAM. The model was trained from scratch, without reliance on any pre-trained weights. For data preparation, all images across the datasets were first normalized to a pixel value range of [0, 1]. Subsequently, to ensure a uniform input size for the network, random cropping was applied to generate patches of 256 × 256 pixels. To enhance model robustness and mitigate the risk of overfitting, we employed a standard online data augmentation strategy. This strategy included random horizontal flipping (with a 50% probability) and random rotations. For network optimization, we utilized the Adam optimizer. The initial learning rate was set to 1 × 10−3. To facilitate stable convergence and fine-tuning in later stages of training, a Cosine Annealing scheduler was employed. This scheduler progressively decayed the learning rate from its initial value to a minimum of 1 × 10−5 over the course of the training process. The model was trained with a batch size of 16. The stopping criterion was set to a fixed number of 1000 training epochs. During the training process, the model’s performance was evaluated on a validation set after each epoch. The set of model weights that achieved the best performance on the validation set was saved. This best-performing model was then used for the final evaluation on the test set to report the final metrics. The entire training procedure for our model required approximately 23 h to complete.
Baselines. To rigorously assess the efficacy of our method, MixMambaNet is benchmarked against a comprehensive suite of state-of-the-art (SOTA) infrared small target detection algorithms. The comparison includes eight prominent learning-based methods: ACM [
11], ALCNet [
17], RDIAN [
27], DNANet [
12], ISTDU-Net [
28], UIU-Net [
14], MTU-Net [
15], MOU-Mamba [
21], and SCTransNet [
9]. The MOU-Mamba reproduction version is designated as MOU-Mamba (re).
4.3. Quantitative Results
In this section, we present a comprehensive quantitative evaluation of MixMambaNet against a range of state-of-the-art (SOTA) methods. The analysis is conducted across three public datasets to demonstrate the effectiveness, robustness, and efficiency of our proposed architecture.
4.3.1. Performance on Individual Datasets
Table 1 and
Table 2 provide a detailed comparison of performance metrics on the NUAA-SIRST, NUDT-SIRST, and IRSTD-1k datasets. Our proposed MixMambaNet demonstrates consistently superior or highly competitive performance across all benchmarks. As shown in
Figure 6.
On the NUAA-SIRST dataset, MixMambaNet achieves the highest scores in mIoU (77.66%), nIoU (82.00%), and Probability of Detection (Pd) at 97.24%. This indicates a remarkable capability in both accurately segmenting the target shape and successfully identifying its presence. While its F-measure (87.17%) is marginally second to SCTransNet (87.32%), its superior IoU scores highlight a more precise pixel-level prediction.
On the NUDT-SIRST dataset, which is larger and contains more complex scenarios, MixMambaNet establishes a new state-of-the-art by a significant margin. It surpasses all other methods across all four key metrics, achieving an mIoU of 95.33%, nIoU of 95.68%, F-measure of 97.17%, and a Pd of 98.77%. This dominant performance underscores the model’s exceptional robustness and its ability to handle diverse target and background variations.
On the challenging IRSTD-1k dataset, our model continues to exhibit strong generalization capabilities. It obtains the best mIoU (69.18%) and nIoU (69.32%), demonstrating superior segmentation accuracy. Furthermore, it achieves a highly competitive F-measure of 80.63% and maintains a low False-Alarm Rate (Fa) of 11.23, second only to SCTransNet. This result validates the effectiveness of MixMambaNet in complex and cluttered real-world environments.
Across all datasets, traditional methods like Top-Hat and WSLCM consistently yield low IoU and F-measure scores, confirming the significant advantage of deep learning-based approaches. Among these, MixMambaNet consistently positions itself as a top-performing model, validating the efficacy of its architectural design.
4.3.2. Comprehensive and Efficiency Analysis
To provide a holistic view of performance and model complexity,
Table 3 presents the average metrics across all datasets, alongside the number of parameters and computational cost (FLOPs). MixMambaNet not only achieves the highest overall accuracy but also demonstrates an excellent balance between performance and efficiency. It records the best average IoU (84.18%), nIoU (87.11%), and F-measure (91.04%).
Crucially, it achieves this superior performance with a more streamlined architecture compared to its main competitors. With 10.36 M parameters and 19.48 G FLOPs, MixMambaNet is more lightweight and computationally efficient than SCTransNet (11.19 M Params, 20.24 G FLOPs) and significantly more so than UIU-Net (50.54 M Params, 54.42 G FLOPs). This favorable trade-off between accuracy and computational cost makes MixMambaNet a more practical solution for deployment in resource-constrained applications.
4.3.3. AUC Analysis for Robustness
To evaluate the model’s performance stability across different decision thresholds, we calculated the Area Under the Curve (AUC) on all three datasets, with results shown in
Table 4. A higher AUC value indicates better detection performance that is less sensitive to the choice of a specific segmentation threshold. The results confirm the robustness of our method. MixMambaNet consistently achieves the highest or second-highest AUC scores across all datasets and evaluation criteria, outperforming or performing on par with the previous leading method, SCTransNet. This demonstrates that the superiority of MixMambaNet is not an artifact of a single optimal threshold but reflects a fundamentally stronger feature representation that reliably distinguishes targets from background across a wide operational range.
4.3.4. Qualitative Comparison
Figure 7 presents a visual comparison of the detection results produced by the proposed MixMambaNet and four representative infrared small target detection models (ALCNet, ACMNet, UIU-Net, DNANet) under highly challenging scenarios. The selected scenes contain diverse difficulties, including low signal-to-noise ratio (SNR), multiple closely spaced targets, and strong background clutter interference. From the results, ALCNet and ACMNet frequently miss faint targets (blue boxes) or misclassify background noise as targets (yellow boxes). UIU-Net demonstrates partial improvement, but still suffers from missed detections in extremely weak signal cases, as visible in the first and second rows. DNANet achieves cleaner segmentation masks but occasionally fails to suppress clutter, leading to false alarms in complex textures. In contrast, MixMambaNet consistently detects all genuine targets (white boxes) with precise segmentation masks that closely match the ground truth, while maintaining effective suppression of high-frequency background noise. This visual evidence reinforces our quantitative evaluation, confirming that MixMambaNet excels at enhancing target saliency while mitigating interference from challenging infrared backgrounds.
4.3.5. Robustness to Faint Signals and Clutter Suppression
To rigorously evaluate the robustness of MixMambaNet against faint target signals and complex background clutter, we conducted experiments on the aggregated test sets of the NUAA-SIRST, NUDT-SIRST, and IRSTD-1k datasets, stratifying the targets into three size categories: small (<10 pixels), medium (10–20 pixels), and large (>20 pixels). The IoU performance was compared with the current state-of-the-art baseline, SCTransNet.
Table 5 reveals a clear and compelling trend. While MixMambaNet demonstrates superior performance across all target sizes, the most substantial advantage is observed in the Small-target category (<10 pixels), with a significant IoU gain of 1.3%. This performance margin decreases for Medium-sized targets (+0.7%) and becomes minimal for Large, more easily discernible targets (>20 pixels, +0.2%).
Qualitative Comparison on Cluttered Backgrounds: To visually substantiate our model’s superior clutter suppression capability,
Figure 8 presents a visual comparison between our proposed MixMambaNet and the baseline model, SCTransNet, on a selection of challenging infrared images. These scenes are chosen to highlight performance under conditions of complex backgrounds, low signal-to-noise ratio, and multiple closely spaced targets. For clarity, we use blue circles to denote missed detections and yellow circles for false alarms.
Row 1 (Building Scene): This scene features multiple small targets positioned near the sharp, high-contrast edges of a building structure. While both models successfully identify all genuine targets, SCTransNet produces a false alarm (indicated by the yellow box), misinterpreting the strong edge features as a target. In contrast, MixMambaNet achieves flawless detection, accurately identifying all targets with zero false positives. This result highlights our model’s superior ability to differentiate true targets from structured background clutter, such as building edges.
Row 2 (Sky Scene with Clustered Targets): In this scenario, a cluster of weak targets appears against a relatively uniform sky background. MixMambaNet once again demonstrates its heightened sensitivity by correctly identifying all targets. The baseline model, SCTransNet, fails to detect the faintest target on the left (indicated by the blue box), underscoring its limitations in low signal-to-noise ratio (SNR) conditions.
Row 3 (Cloud Clutter Scene): This case presents an exceptionally difficult challenge, with faint targets deeply embedded in diffuse cloud structures. SCTransNet struggles significantly, missing some targets while also producing spurious detections in the noisy cloud regions. Conversely, MixMambaNet performs perfectly, detecting all genuine targets while effectively suppressing the surrounding cloud clutter, resulting in zero missed detections and zero false alarms.
In summary, these qualitative results compellingly demonstrate that MixMambaNet establishes a new state-of-the-art balance between detection sensitivity and clutter suppression. It not only exhibits superior capability in detecting extremely faint and clustered targets but also shows remarkable robustness in rejecting complex background interference, a critical advantage over existing methods.
4.4. Ablation Study
To rigorously validate the effectiveness of each core component within our MixMambaNet architecture, we conducted a series of ablation experiments on the NUDT-SIRST dataset. Starting with a foundational baseline model, we incrementally integrated our three proposed modules: the Perception-aware Hybrid Encoder (PHE), the MixMamba Bottleneck (MMB), and the Non-local Mamba Aggregation (NMA) module. The impact of each addition was evaluated using IoU and nIoU metrics, with the results detailed in
Table 6.
Baseline Model: Our baseline model employs SCTransNet as the foundational performance benchmark, achieving an IoU of 81.71%.
Effectiveness of PHE: By replacing the standard encoder with our PHE module, the IoU increased to 81.96%. This confirms the efficacy of the PHE’s parallel structure in capturing a richer, multi-paradigm feature representation at the encoding stage.
Distinct Roles of MMB and NMA: To clarify the unique contributions of the MMB and NMA modules, we analyzed their effects both independently and jointly.
First, we introduced the MMB at the bottleneck of the PHE-enhanced model. This led to a notable performance increase, raising the IoU to 82.28%. This demonstrates the critical role of the MMB in performing global scene disambiguation on the most compressed, high-level features, effectively suppressing complex background clutter before decoding.
Next, to isolate NMA’s function, we tested a configuration with PHE and NMA but without MMB. This model achieved an IoU of 82.45%. The improvement over the “Baseline + PHE” model shows that the NMA is effective at context-aware feature fusion during the decoding phase, intelligently aggregating multi-scale features from skip connections.
However, this “PHE + NMA” configuration is still outperformed by the full model (IoU 82.68%), which includes the MMB. This crucial comparison reveals that while NMA refines feature fusion, it cannot fully compensate for the absence of MMB’s dedicated global context modeling at the bottleneck. The MMB provides a holistic scene understanding that the NMA then leverages to achieve more precise boundary refinement.
Synergistic Effect: The final configuration, representing the complete MixMambaNet (PHE + MMB + NMA), achieved the best overall performance with an IoU of 82.68%. This confirms that the modules are not redundant; rather, they operate synergistically.
In summary, the ablation study systematically demonstrates that each proposed module contributes positively. The MMB excels at global context disambiguation at the bottleneck, while the NMA specializes in leveraging that context for superior feature aggregation during decoding. Their combined, synergistic effect is key to the final performance of MixMambaNet.
5. Discussion and Future Work
5.1. Discussion
Our empirical results demonstrate that MixMambaNet establishes a new state-of-the-art in infrared small target detection, excelling in both accuracy and computational efficiency. The success of our approach is primarily attributed to the strategic integration of State Space Models (SSMs) within a hybrid CNN framework. Unlike traditional CNNs, which are constrained by local receptive fields, or Transformers, which incur quadratic complexity, Mamba captures long-range dependencies with linear complexity. This is particularly well-suited for IRSTD, where a target’s context is defined by its relationship to the entire global background.
The synergy between our proposed modules—the Perception-aware Hybrid Encoder (PHE), the MixMamba Bottleneck (MMB), and the Non-local Mamba Aggregation (NMA)—validates our design philosophy. This hierarchical application of Mamba provides a rich feature foundation, models the global context effectively, and intelligently fuses it with fine-grained spatial details. To provide a holistic yet concise perspective,
Table 7 systematically summarizes the primary advantages and limitations of MixMambaNet.
5.2. Future Work
The limitations identified in
Table 7 directly inform our agenda for future research. We propose to extend this work along the following concrete pathways:
Enhancing Real-World Robustness: To address the current generalization boundaries, our immediate focus will be on collecting and annotating a new, large-scale dataset featuring adverse weather, diverse sensor types, and varying noise profiles. This will enable the development of more robust models through targeted data augmentation and domain generalization techniques.
Optimization for Edge Deployment: To move MixMambaNet from a high-performance model to a deployable one, we will investigate quantization-aware training (QAT) and structured pruning methods tailored for Mamba architectures. The goal is a lightweight version that meets the strict latency and memory requirements of real-time edge computing without significant accuracy loss.
Improving Architectural Interpretability: To demystify the “black box” nature of Mamba’s selective scan mechanism, we aim to develop novel visualization techniques. This research will help trace the influence of input pixels and understand the model’s decision-making process, fostering trust and enabling more targeted architectural refinements.
Addressing Data Scarcity and Extreme Cases: To reduce the dependency on large labeled datasets and improve performance on the most challenging sub-pixel or low-SNR targets, we will explore semi-supervised learning and physics-informed neural network (PINN) concepts. This could embed physical properties of target signatures directly into the model, enhancing its capability in data-scarce and extreme scenarios.