MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD

Zhang, Zikang; Yin, Songfeng

doi:10.3390/electronics14224527

Open AccessArticle

MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD

by

Zikang Zhang

^1,* and

Songfeng Yin

²

¹

Electronic Engineering Institute, National University of Defense Technology, Hefei 230037, China

²

Hefei Institute for Public Safety Research, Tsinghua University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4527; https://doi.org/10.3390/electronics14224527

Submission received: 11 September 2025 / Revised: 2 November 2025 / Accepted: 7 November 2025 / Published: 19 November 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Infrared small target detection (IRSTD) is hindered by low signal-to-noise ratios, minute object scales, and strong target–background similarity. Although long-range skip fusion is exploited in SCTransNet, the global context is insufficiently captured by its convolutional encoder, and the fusion block remains vulnerable to structured clutter. To address these issues, a Mamba-enhanced framework, MixMambaNet, is proposed with three mutually reinforcing components. First, ResBlocks are replaced by a perception-aware hybrid encoder, in which local perceptual attention is coupled with mixed pixel–channel attention along multi-branch paths to emphasize weak target cues while modeling image-wide context. Second, at the bottleneck, dense pre-enhancement is integrated with a selective-scan 2D (SS2D) state-space (Mamba) core and a lightweight hybrid-attention tail, enabling linear-complexity long-range reasoning that is better suited to faint signals than quadratic self-attention. Third, the baseline fusion is substituted with a non-local Mamba aggregation module, where DASI-inspired multi-scale integration, SS2D-driven scanning, and adaptive non-local enhancement are employed to align cross-scale semantics and suppress structured noise. The resulting U-shaped network with deep supervision achieves higher accuracy and fewer false alarms at a competitive cost. Extensive evaluations on NUDT-SIRST, NUAA-SIRST, and IRSTD-1k demonstrate consistent improvements over prevailing IRSTD approaches, including SCTransNet.

Keywords:

infrared small target detection; selective-scan 2D; non-local multi-scale aggregation; multi-scale fusion

1. Introduction

Infrared small target detection (IRSTD) is a cornerstone technology for ensuring safety and maintaining strategic advantage in high-stakes domains such as national defense, aviation safety, and autonomous systems [1,2,3,4]. The practical urgency of this field is underscored by its critical applications: detecting stealth aircraft or incoming missile threats in military early warning systems, identifying unmanned aerial vehicles (UAVs) in restricted airspace, and spotting potential obstacles for safe navigation in maritime surveillance and autonomous driving, especially under low-visibility conditions. In these scenarios, targets are often small, faint, and distant, appearing as mere pixel clusters with no discernible shape or texture. They are easily submerged in complex backgrounds containing heavy cloud clutter, sea-sky lines, or urban thermal noise. The failure to reliably detect such targets can have catastrophic consequences, making the development of highly sensitive and robust detection algorithms not just an academic pursuit, but a practical imperative.

Figure 1 illustrates infrared small targets in real-world scenarios, targets typically occupy only a few pixels, are embedded in heterogeneous, nonstationary backgrounds, and exhibit low signal-to-clutter ratios (SCR). The lack of color and texture cues, frequent defocus or motion blur, and pronounced scale imbalance further complicate the problem, causing weak target signals to be easily submerged by structured clutter.

Before the widespread adoption of deep learning, single-frame IRSTD relied largely on model-driven techniques, including spatial-frequency filtering, human visual system (HVS)-inspired contrast mechanisms, and low-rank/sparse decomposition [5,6,7,8]. While these approaches offer interpretability and modest computational cost, their effectiveness hinges on handcrafted priors and narrow operating assumptions. As reported across remote sensing and surveillance studies, robustness degrades when scene statistics drift, clutter changes, or targets deviate from the assumed models, limiting practical deployment [9].

With the emergence of public datasets, IRSTD has increasingly been formulated as a fine-grained segmentation task, where U-shaped encoder–decoder networks with skip connections preserve spatial detail while learning semantic abstractions [10]. Numerous variants enhance cross-layer interaction and feature utilization through asymmetric top-down/bottom-up modulation, dense nested aggregation, and attention-guided fusion [11,12]. Nonetheless, persistent limitations are observed in single-frame pipelines. First, repeated downsampling suppresses weak small-target activations and disrupts hierarchical information flow, impeding the consolidation of fragmentary evidence into coherent semantics [12]. Second, the semantic gap between encoder outputs and decoder inputs is insufficiently bridged by naïve skips or rudimentary fusion, allowing clutter to propagate and diminishing boundary precision for tiny targets [9]. Third, long-range contextual perception remains unreliable in deeper layers: background continuity is under-modeled, target–background similarity is high, and spurious responses elevate false alarms [13].

To alleviate the limited contextual field of CNNs, Transformer-based designs were introduced, but their quadratic computational complexity (O(N²)) creates a significant accuracy-efficiency dilemma, especially for high-resolution imagery typical in IRSTD. This calls for a new architectural paradigm. The recently proposed Mamba architecture, a State Space Model (SSM) with linear complexity (O(N)), presents a compelling alternative. However, to merely view Mamba as a more efficient substitute would be to overlook its fundamental functional novelty. The Transformer’s self-attention performs a static, all-to-all comparison, whereas Mamba operates via a dynamic, sequential scan with content-aware gating. This represents a paradigm shift from static global comparison to dynamic, selective global modeling—a more robust capability for distinguishing faint targets from deceptive backgrounds.

However, simply adopting this new paradigm is insufficient. The fundamental challenges of IRSTD demand a systematic architectural response. Specifically, three critical bottlenecks must be addressed: (1) Early-stage signal preservation: weak target signatures are easily lost to signal drowning when standard convolutions uniformly process all pixels. (2) Global scene disambiguation: at the network’s deepest point, limited receptive fields fail to resolve “deceptive clutter” that is locally indistinguishable from targets. (3) Semantic-aware fusion: during upsampling, the semantic gap between high-level features and spatial details allows clutter to propagate through naïve skip connections, diminishing boundary precision.

Guided by these principles, we develop MixMambaNet, a Mamba-enhanced IRSTD framework that systematically addresses each bottleneck through three synergistic modules. First, the Perception-aware Hybrid Encoder (PHE) replaces conventional residual blocks to combat signal drowning by decoupling local perceptual attention from mixed pixel-channel attention, thereby strengthening minute structures while retaining image-wide statistics. Second, the MixMamba Bottleneck (MMB) leverages selective-scan 2D state-space modeling to provide efficient long-range reasoning with linear complexity, resolving global ambiguities that CNNs cannot capture and Transformers address only at prohibitive cost. Third, the Non-local Mamba Aggregation (NMA) module substitutes standard skip fusion to bridge the semantic gap through adaptive, context-aware feature fusion that aligns cross-scale semantics and filters structured clutter. The resulting U-shaped network employs deep supervision for stable optimization and improved discriminability across decoder stages. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1k show consistent gains over prevailing CNN, Transformer, and hybrid approaches, including SCTransNet, with competitive efficiency [9,11,14,15].

The contributions of this work are summarized as follows:

Conventional residual blocks are replaced with a perception-aware hybrid encoder, where local perceptual attention is coupled with mixed pixel–channel attention in multi-branch paths so that fine, low-contrast structures are amplified while global context is injected early to suppress structured distractors.
Dense pre-enhancement is integrated with a selective-scan 2D state-space core (Mamba) and a lightweight hybrid-attention tail to realize linear-complexity long-range reasoning that is better matched to the weak-signal characteristics of IRSTD than quadratic self-attention.
Standard skip fusion is replaced with a non-local Mamba aggregation module, in which DASI-style multi-scale integration, SS2D-driven selective scanning, and adap-tive non-local enhancement are combined to align cross-scale semantics and to mod-el background continuity, thereby reducing spurious responses in cluttered regions.

2. Related Work

The detection of infrared small targets (IRSTD) has evolved significantly, transitioning from traditional model-driven methods to advanced data-driven deep learning paradigms. This section reviews this evolution, critically analyzing the limitations of existing approaches and contextualizing the architectural innovations of our proposed MixMambaNet.

2.1. Traditional Model-Driven Methods

Early IRSTD research was dominated by model-driven approaches that leveraged hand-crafted features derived from the assumed physical properties of targets. These methods can be broadly classified into filtering-based and Human Visual System (HVS) inspired techniques. Filtering-based methods, such as the classic Top-Hat transform [16], operate by subtracting a morphologically opened (background) image from the original, thereby isolating bright regions. While straightforward, their efficacy is critically dependent on the size and shape of the structural element, which struggles to adapt to variations in target scale and background complexity, often leading to significant background residuals.

Inspired by how humans perceive salient objects, HVS-based methods like the Local Contrast Method (LCM) [7] and its numerous variants were developed. These methods compute a saliency map based on the contrast between a central patch and its surrounding pixels. While computationally efficient and intuitive, their reliance on local statistics makes them highly susceptible to false alarms triggered by strong edges, corner points, or even pixel-level noise that mimics high local contrast. The fundamental deficiency uniting this entire class of traditional methods is their reliance on fixed, low-level priors. They lack the capacity to learn from data, rendering them brittle and unable to generalize across the diverse and dynamic scenes encountered in real-world applications.

2.2. CNN-Based Methods

The advent of deep learning, particularly Convolutional Neural Networks (CNNs), marked a paradigm shift in IRSTD. By learning hierarchical feature representations directly from data, CNN-based methods have consistently outperformed their traditional counterparts. Seminal works like ACM [11] and ALCNet [17] established the effectiveness of the U-Net-style encoder–decoder architecture, often enhanced with attention mechanisms or contextual modules to refine feature maps. Subsequent developments, including DNA-Net [12] and UIU-Net [14], introduced more intricate feature fusion strategies and interactive connections to bridge the semantic gap between the encoder and decoder paths.

Despite these advancements, a fundamental and inherent limitation persists across all CNN-based architectures: the locality of the convolution operation. As information propagates through deeper layers, the effective receptive field, while growing, remains fundamentally constrained. This makes it challenging for the network to model the long-range spatial dependencies necessary to perform global scene disambiguation. For instance, a CNN may struggle to differentiate a true, isolated target from a small, bright patch that is part of a larger, distant clutter structure (e.g., a building’s window frame). This locality constraint is the primary cause of “deceptive clutter” false alarms and is a core challenge that our work aims to address.

2.3. Advanced Methods

To overcome the inherent locality of CNNs, a significant body of recent research has focused on designing architectures capable of capturing global context. This pursuit has primarily branched into two main avenues: integrating Transformer-based modules and exploring other non-local modeling paradigms.

Transformer-based Hybrids: This is currently the most prominent approach. Methods like SCTransNet [9] incorporate blocks from hierarchical Transformers (e.g., Swin Transformer) into the U-Net backbone. The self-attention mechanism within these blocks allows the model to compute relationships between all feature tokens, thereby achieving a global receptive field. Similarly, TransUNet [18] demonstrates that combining a Transformer-based encoder for global context with a CNN-based decoder for precise localization can yield strong performance in segmentation tasks. While this hybrid strategy has proven effective for reducing false alarms caused by structured, non-local clutter, it comes at a steep price: the quadratic computational complexity (O(N²)) of the self-attention mechanism with respect to the number of image tokens. This makes pure or heavy Transformer models inefficient, particularly for high-resolution infrared images, hindering their practical deployment in resource-constrained or real-time scenarios.

Other Non-Local Modeling Paradigms: Beyond Transformers, researchers have explored other ways to model long-range dependencies. For instance, Non-local Neural Networks [19] introduce a generic “non-local” block that computes the response at a position as a weighted sum of features at all positions. While this successfully captures global dependencies, much like a single self-attention layer, it also suffers from high computational cost and can be challenging to integrate efficiently. More recently, new paradigms continue to emerge. For example, MLEDNet [1] proposes a dense nested network that assists detection with multi-directional learnable edge information. While innovative, its primary focus on local edge priors may not fully resolve the challenge of suppressing large-scale, non-edge-based clutter, which requires a more holistic scene understanding.

A common thread connects all these advanced methods: a persistent and challenging trade-off between the ability to model global context and the demand for computational efficiency. CNNs are efficient but local; Transformers and non-local blocks are global but computationally expensive. This creates a critical and unmet need for an architecture that can achieve comprehensive global context modeling with high computational efficiency.

2.4. The Emergence of Mamba-Based Methods

The recent introduction of State Space Models (SSMs), particularly the Mamba architecture [20], has presented a highly promising solution to the efficiency-performance trade-off. With their ability to model long-range dependencies in linear time, Mamba-based models are rapidly emerging as a compelling alternative to Transformers in the IRSTD field.

A notable contemporary work in this direction is MOU-Mamba [21], which proposes a Multi-Order U-shape Mamba architecture. Its core innovation lies in redesigning the Mamba block itself to be more adept at visual tasks, introducing a Multi-Order 2D-Selective-Scan (MO-SS2D) module to capture dependencies at various scales. This approach focuses on enhancing the internal capabilities of the Mamba block, making it a more powerful, self-contained unit for feature extraction.

This design philosophy, however, presents a clear architectural trade-off. By tasking a single, complex block with handling both local and global feature processing, it potentially overlooks the distinct advantages and proven efficiency of specialized modules. The attempt to create a versatile, “all-in-one” Mamba block leads to a less direct architectural strategy, where the inherent strengths of different computational paradigms, such as the local feature expertise of CNNs, are not explicitly leveraged. This points to an alternative and potentially more efficient architectural direction: a synergistic framework where different modules collaborate based on their specialized strengths.

Our proposed MixMambaNet directly addresses this architectural question by opting for a synergistic, hybrid approach. Instead of modifying the core Mamba block, we leverage its primary strength—efficient global modeling—and integrate it within a carefully designed framework where it collaborates with CNNs. This clear division of labor is our core advantage: CNNs for Local Features: We retain CNNs for what they do best: efficient and robust extraction of local spatial features. Mamba for Global Context: We strategically deploy Mamba modules at critical points in the network (PHE, MMB, NMA) to establish long-range dependencies and perform global scene disambiguation. This architectural philosophy of synergistic delegation, rather than monolithic block enhancement, allows MixMambaNet to achieve a superior balance of performance, parameter efficiency (10.36 M vs. MOU-Mamba’s 13.44 M), and architectural clarity, establishing a new state-of-the-art.

3. Approach

Building upon the three critical bottlenecks identified in Section 1, this section details how MixMambaNet’s architecture systematically addresses each challenge. We first present the overall network structure and data flow, followed by in-depth descriptions of the three core modules and their specific roles in the detection pipeline.

3.1. Overall Architecture

The overall structure of MixMambaNet is based on a U-shaped encoder–decoder paradigm. The network is designed to process a single-channel infrared image

I_{in} \in ℝ^{1 \times H \times W}

. The architecture is shown in Figure 2.

The encoder is composed of five sequential stages, designed to learn a hierarchy of feature representations

{E_{0}, E_{1}, E_{2}, E_{3}, E_{4}}

. First, the input image is regarded as the initial feature map E_0. Subsequently, each stage

i \in {1, 2, 3, 4}

processes the feature map from the previous stage to produce a downsampled, higher-dimensional representation. This process is defined as follows:

E_{i} = {Down}_{i} (F_{PHE} (E_{i - 1})),

(1)

where

F_{PHE}

denotes the operation of the Perception-aware Hybrid Encoder blocks, and Downi represents the max-pooling operation at stage i.

The bottleneck of the network receives the most abstract feature map, E₄, from the final encoder stage. This component is replaced by our MixMamba Bottleneck, which forms a globally aware contextual representation, B, as follows:

B = F_{MMB} (E_{4}),

(2)

where

F_{MMB}

represents the function of the MixMamba Bottleneck module.

The decoder symmetrically reconstructs the spatial resolution through four corresponding stages, producing feature maps

{D_{4}, D_{3}, D_{2}, D_{1}}

. At each stage

i \in {4, 3, 2, 1}

, the feature map from the preceding deeper layer is upsampled and fused with the refined features from the corresponding encoder stage. This operation can be expressed as follows:

N_{i} = (F_{NMA} ({Down}_{i} (E_{i - 1}), E_{i}),

(3)

D_{i} = F_{Dec} ({Up}_{i} (D_{i + 1}), N_{i}),

(4)

where D₅ is defined as the bottleneck output

B

.

F_{D e c}

is the decoder fusion block (concatenation followed by convolution), Up_i is the bilinear upsampling operation, and

F_{NMA}

is the function of our Non-local Mamba Aggregation module, which filters the encoder features.

A multi-scale, deeply supervised fusion strategy is incorporated into MixMambaNet to support gradient propagation and refine feature representations. In this process, each decoder output D_i is passed through a 1 × 1 convolution and a sigmoid function to generate a corresponding saliency map S_i, formulated as follows:

S_{i} = Sigmoid (f_{1 \times 1} (D_{i})) (i = 1, 2, 3, 4, 5),

(5)

The subsequent step involves restoring the low-resolution salient maps

S_{i} (i = 2, 3, 4, 5)

to the original image size via upsampling. These resolution-matched maps are then aggregated to form the final map S_Σ, which is computed as follows:

S_{\sum} = Sigmoid (f_{1 \times 1} [S_{1}, β (S_{2}), β (S_{3}), β (S_{4}), β (S_{5})]),

(6)

The operators [·] and

β

are defined as channel-wise concatenation and bilinear interpolation, respectively. In the final stage, a Binary Cross-Entropy (BCE) loss is calculated to measure the discrepancy between the overall saliency maps and the ground truth (GT) Y. The total loss is the sum of this and the previously computed loss values.

L_{1} = l_{B C E} (S_{1}, Y), L_{i} = l_{B C E} (S_{i}, {Down}_{i - 1} (Y)) (i = 2, 3, 4, 5), L_{\sum} = l_{B C E} (S_{\sum}, Y), L_{t o t a l} = λ_{1} L_{1} + λ_{2} L_{2} + λ_{3} L_{3} + λ_{4} L_{4} + λ_{5} L_{5} + λ_{6} L_{\sum}

(7)

Here, λ_i (i ∈ [2, 3, 4, 5]) serves as the weighting coefficient for each respective loss function. Based on the precedent established in similar research, these coefficients are all assigned a value of 1 for the experiments.

3.2. Perception-Aware Hybrid Encoder

3.2.1. Motivation

Building upon the early-stage signal drowning challenge outlined in Section 1, the Perception-aware Hybrid Encoder (PHE) is designed to provide discriminative processing from the very first layer of the network. As established in the Introduction, weak target signals are easily overwhelmed by strong background clutter during initial feature extraction when standard convolutional blocks process all pixels uniformly. This is particularly acute in low Signal-to-Clutter Ratio (SCR) scenarios.

The PHE addresses this bottleneck through an explicit feature decoupling strategy, separating the extraction process into two specialized, parallel branches: a Local Perceptual Attention [22] (LPA) branch that acts as a “target spotter” focusing on local discontinuities, and a Mixed Pixel-Channel Attention [23] (MPCA) branch that serves as a “context modeler” capturing broader background statistics. By independently handling these conflicting objectives and then intelligently fusing their outputs, the PHE simultaneously amplifies the target signal while generating a clutter-suppressed contextual representation, significantly improving the signal-to-clutter ratio in the feature domain.

3.2.2. Architecture

The architecture of the PHE module, as illustrated in Figure 3, is designed to achieve a comprehensive feature representation by processing an input feature map,

F_{in}

, through a multi-branch parallel structure. The central path employs a sequence of three convolutional layers

F_{Conv}

to extract local context. Concurrently, two symmetric branches model non-local dependencies. Within each attention branch, the input

F_{in}

is first transformed into a sequence of representative tokens,

T

, through patching and mean operations. An attention map,

A

, is then computed by processing these tokens through a Feed-Forward Network [24] (

FFN

) and a Softmax function. This map subsequently modulates the tokens via element-wise multiplication (

\otimes

). The resulting features are refined by sequential spatial and channel selection [25] mechanisms to produce the branch output,

F_{attn}

. This process can be summarized as follows:

T = Tokenize (F_{in}),

(8)

A = Softmax (FFN (T)),

(9)

F_{attn} = S elect (A \otimes T),

(10)

where

Tokenize

(·) represents the patch embedding and mean operations, and Select (·) denotes the combined spatial and channel selection functions. The outputs from the central local branch and the two non-local branches (

F_{attn_1}

,

F_{attn_1}

) are then aggregated via element-wise addition:

F_{fused} = F_{Conv} (F_{in}) + F_{attn_1} + F_{attn_2},

(11)

This fused feature map,

F_{fused}

, undergoes a final, sophisticated refinement stage. It is simultaneously processed by a central Channel Attention module (

F_{CA}

) and two parallel Pixel Attention modules (

F_{PA}

). The outputs are concatenated and passed through a final MLP block to produce the output

F_{out}

. This terminal step ensures the integrated features are optimally weighted and refined. The process is defined as follows:

F_{refined} = Concat (F_{PA} (F_{fused}), F_{CA} (F_{fused}), F_{PA} (F_{fused})),

(12)

F_{out} = MLP (F_{refined}),

(13)

where Concat (·) is the concatenation operation along the channel axis, and the MLP con-sists of two 1 × 1 convolutions with a GELU activation function in between.

3.3. MixMamba Bottleneck

3.3.1. Motivation

As articulated in Section 1, the network bottleneck must resolve the “deceptive clutter” problem—pixel clusters that are locally indistinguishable from true targets but can be disambiguated through holistic scene understanding. Traditional CNNs’ limited receptive fields are insufficient for this task, while Transformers’ quadratic self-attention complexity (O(N²)) is prohibitively expensive for high-resolution infrared imagery.

The MixMamba Bottleneck (MMB) is engineered to address this specific challenge. Operating at the most abstract point of the network, where feature maps have the lowest spatial resolution but richest semantic information, the MMB’s primary function is to establish a definitive, scene-level understanding. By leveraging the linear-complexity selective-scan mechanism, it performs global scene disambiguation, effectively arbitrating between true targets and deceptive clutter based on their relationship to the entire scene context. This module acts as the network’s central decision-maker, resolving global ambiguities before the decoding process begins—a capability fundamentally beyond the reach of local convolutions.

3.3.2. Architecture

The MixMamba Bottleneck (MMB) module, depicted in Figure 4, is engineered to synergistically integrate rich local feature extraction with efficient global context modeling. An input feature map,

Z_{in}

, is initially processed by a densely connected convolutional block, designed to extract a rich hierarchy of local features through a series of 3 × 3 Conv and GELU layers with forward-feeding skip connections. Subsequently, aggregated local features are fed into the core 2D Selective Scanning (SS2D) module. The input 2D feature map undergoes flattening processing before being scanned along four directions: from top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right. These scanning sequences are independently processed by the Mamba module before being reconstructed into the original 2D shape. This design enables the SS2D module to efficiently model long-range dependencies across the entire feature map with linear complexity, defined as follows:

F_{SS 2 D} = Merge (Mamba ({Scan}_{1} (x)), \dots, Mamba ({Scan}_{4} (x))),

(14)

where

{Scan}_{i}

represents the operation that flattens the 2D map into a 1D sequence according to the i-th scanning direction (e.g., forward, reverse, etc.). Mamba is the core State Space Model block that processes the 1D sequence. Merge is the function that aggregates the outputs from the four directional scans and reshapes the result back into a 2D feature map of shape.

The output results are then stabilized by the batch normalization layer. This initial sequence of operations can be expressed as follows:

Z_{mid} = BN (F_{SS 2 D} (F_{Dense} (Z_{in}))),

(15)

where

F_{Dense}

represents the entire densely connected convolutional block,

F_{SS 2 D}

is the 2D Selective Scan operation, and

BN (\cdot)

is the Batch Normalization function.

Following this, the normalized feature map

Z_{mid}

, undergoes a final attention-based refinement stage identical in structure to that of the PHE module. This stage adaptively recalibrates the globally aware feature map to produce the module’s output,

Z_{out}

. The process is defined as follows:

Z_{refined} = Concat (F_{P A} (Z_{mid}), F_{C A} (Z_{mid}), F_{P A} (Z_{mid})), Z_{out} = MLP (Z_{refined}),

(16)

3.4. Non-Local Mamba Aggregation

3.4.1. Motivation

While the MMB establishes a global understanding at the bottleneck, a separate challenge arises during the decoder’s upsampling path: the semantic gap. As established, deep features from the decoder carry high-level semantic information (“targetness”) but lack spatial precision. Conversely, shallow features from the encoder’s skip connections retain fine-grained details but are semantically naïve. Naïve fusion allows structured clutter from the shallow features to propagate, degrading segmentation accuracy.

The Non-local Mamba Aggregation (NMA) module is specifically designed to resolve this fusion problem. Its role is not to form the initial global context, but to leverage global context to intelligently mediate the fusion of multi-scale features. It operates as a dynamic, context-aware gate on the skip connection pathway. Through semantically guided adaptive fusion, the NMA uses the high-level features to guide the refinement of low-level details. The subsequent application of Mamba’s selective scanning ensures that this refinement is performed with an awareness of the global context, restoring fine-grained target boundaries while actively suppressing structured noise. In essence, the NMA acts as a semantic-aware filter, ensuring the final output is both spatially precise and globally consistent.

3.4.2. Architecture

The Non-Local Mamba Aggregation (NMA) module, as illustrated in Figure 5, is employed to model the global context by adaptively fusing high-level feature maps

F_{high}

with low-level feature maps

F_{low}

. An adaptive gating mechanism is employed, where

F_{high}

generates a spatial attention map through a Sigmoid function,

σ (\cdot)

, to dynamically weight the contribution of both feature maps in a complementary manner. The fused feature map

F_{fused}

, is computed as follows:

F_{fused} = σ (F_{high}) \otimes F_{high} + (1 - σ (F_{high})) \otimes F_{low},

(17)

where

\otimes

denotes element-wise multiplication. This balanced representation is then processed by an SS2D module to model global context across the integrated feature space.

The globally aware feature map output by the SS2D module undergoes a final multi-branch refinement stage. This stage, denoted as

F_{Refine} (\cdot)

, consists of three parallel paths designed to extract diverse characteristics: a channel attention path, a complex local interaction path, and a residual connection. The outputs are additively aggregated to produce the final, highly refined output of the module,

F_{NMA}

. This entire second stage can be summarized as follows:

F_{NMA} = F_{Refine} (F_{SS 2 D} (F_{fused}))

(18)

4. Experimental

4.1. Evaluation Metrics

To comprehensively evaluate the performance of our proposed model and conduct a fair comparison with state-of-the-art methods, we employ a set of widely recognized metrics. These metrics are categorized into two groups: those assessing detection and segmentation accuracy, and those evaluating model complexity and computational efficiency.

4.1.1. Accuracy and Detection Performance Metrics

Intersection over Union (IoU): As a primary metric for segmentation quality, IoU quantifies the spatial overlap between the predicted segmentation map ( $A_{pred}$ ) and the ground-truth mask ( $A_{gt}$ ). It is calculated as the ratio of the area of their intersection to the area of their union. A higher IoU score signifies a more precise alignment of the predicted target shape with the actual target. It is formally defined as follows:

$IoU = \frac{∣ A_{pred} \cap A_{gt} ∣}{∣ A_{pred} \cup A_{gt} ∣} = \frac{TP}{TP + FP + FN},$

(19)

where TP, FP, and FN represent the counts of true positive, false positive, and false negative pixels, respectively.
Normalized Intersection over Union (nIoU): To mitigate the potential bias in the standard IoU [11] metric, where datasets containing targets of varying sizes might disproportionately influence the average score, we also adopt the nIoU. This metric calculates the IoU for each target individually and then averages these scores across all targets present in the dataset. This ensures that each target, regardless of its size, contributes equally to the final performance score.
F-measure: The F-measure provides a balanced assessment of a model’s performance by computing the harmonic mean of Precision and Recall. Precision measures the accuracy of the positive predictions (i.e., the proportion of correctly identified target pixels among all pixels predicted as targets), while Recall measures the model’s ability to identify all actual target pixels. The F-measure is particularly useful in scenarios with significant class imbalance, a common characteristic of small target detection. The formulas are as follows:

$Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}, F-measure = 2 \times \frac{Precision \times Recall}{Precision + Recall},$

(20)
Probability of Detection ( $P_{d}$ ): This metric evaluates the model’s efficacy at the target level rather than the pixel level. It represents the ratio of correctly detected targets ( $N_{correct}$ ) to the total number of actual targets ( $N_{all}$ ) in the dataset. A target is considered correctly detected if the centroid deviation between its predicted segmentation and the ground-truth mask is within a predefined pixel threshold (Follow the same type of paper [9,12] and set it to 3). Pd = Number of Correctly Detected Targets Total Number of Actual Targets.

$P_{d} = \frac{N_{c o r r e c t}}{N_{a l l}},$

(21)
False-Alarm Rate ( $F_{a}$ ): The False-Alarm Rate is a critical indicator of a model’s robustness against background clutter. It measures the proportion of background pixels that are incorrectly classified as target pixels ( $F_{false}$ ) relative to the total number of pixels ( $P_{all}$ ) in the entire image. A lower $F_{a}$ value indicates a stronger capability to suppress background noise and reduce spurious detections.

$F_{a} = \frac{P_{false}}{P_{all}},$

(22)

4.1.2. Model Complexity and Efficiency Metrics

6.: Parameters (Params): This metric refers to the total number of learnable parameters (i.e., weights and biases) within the network. It serves as an indicator of the model’s size and static memory requirements. The number of parameters is typically reported in millions (M).
7.: Floating Point Operations (FLOPs): FLOPs quantify the computational complexity of a model. This metric represents the total count of floating-point arithmetic operations required to process a single input image in a forward pass. It is a hardware-independent measure of the model’s theoretical inference speed, commonly expressed in GigaFLOPs (G).

4.2. Experiment Settings

Datasets. Our empirical evaluation is conducted on three widely used public benchmarks for infrared small target detection: NUAA-SIRST [11], NUDT-SIRST [12], and IRSTD-1k [26], which contain 427, 1327, and 1000 images, respectively. To ensure a fair and reproducible comparison, we adhere to the standard data partitioning protocols established in their original publications. Specifically, the training and test splits for NUAA-SIRST and NUDT-SIRST follow the methodology proposed by Dai et al., while the splits for IRSTD-1k are based on the work of Wang et al.

Implementation Details. All experiments were conducted within the PyTorch 2.1 deep learning framework. The model was trained and evaluated on a workstation equipped with a single NVIDIA GeForce RTX 3090 GPU, an Intel Core i7-12700KF CPU, and 32 GB of RAM. The model was trained from scratch, without reliance on any pre-trained weights. For data preparation, all images across the datasets were first normalized to a pixel value range of [0, 1]. Subsequently, to ensure a uniform input size for the network, random cropping was applied to generate patches of 256 × 256 pixels. To enhance model robustness and mitigate the risk of overfitting, we employed a standard online data augmentation strategy. This strategy included random horizontal flipping (with a 50% probability) and random rotations. For network optimization, we utilized the Adam optimizer. The initial learning rate was set to 1 × 10⁻³. To facilitate stable convergence and fine-tuning in later stages of training, a Cosine Annealing scheduler was employed. This scheduler progressively decayed the learning rate from its initial value to a minimum of 1 × 10⁻⁵ over the course of the training process. The model was trained with a batch size of 16. The stopping criterion was set to a fixed number of 1000 training epochs. During the training process, the model’s performance was evaluated on a validation set after each epoch. The set of model weights that achieved the best performance on the validation set was saved. This best-performing model was then used for the final evaluation on the test set to report the final metrics. The entire training procedure for our model required approximately 23 h to complete.

Baselines. To rigorously assess the efficacy of our method, MixMambaNet is benchmarked against a comprehensive suite of state-of-the-art (SOTA) infrared small target detection algorithms. The comparison includes eight prominent learning-based methods: ACM [11], ALCNet [17], RDIAN [27], DNANet [12], ISTDU-Net [28], UIU-Net [14], MTU-Net [15], MOU-Mamba [21], and SCTransNet [9]. The MOU-Mamba reproduction version is designated as MOU-Mamba (re).

4.3. Quantitative Results

In this section, we present a comprehensive quantitative evaluation of MixMambaNet against a range of state-of-the-art (SOTA) methods. The analysis is conducted across three public datasets to demonstrate the effectiveness, robustness, and efficiency of our proposed architecture.

4.3.1. Performance on Individual Datasets

Table 1 and Table 2 provide a detailed comparison of performance metrics on the NUAA-SIRST, NUDT-SIRST, and IRSTD-1k datasets. Our proposed MixMambaNet demonstrates consistently superior or highly competitive performance across all benchmarks. As shown in Figure 6.

On the NUAA-SIRST dataset, MixMambaNet achieves the highest scores in mIoU (77.66%), nIoU (82.00%), and Probability of Detection (Pd) at 97.24%. This indicates a remarkable capability in both accurately segmenting the target shape and successfully identifying its presence. While its F-measure (87.17%) is marginally second to SCTransNet (87.32%), its superior IoU scores highlight a more precise pixel-level prediction.

On the NUDT-SIRST dataset, which is larger and contains more complex scenarios, MixMambaNet establishes a new state-of-the-art by a significant margin. It surpasses all other methods across all four key metrics, achieving an mIoU of 95.33%, nIoU of 95.68%, F-measure of 97.17%, and a Pd of 98.77%. This dominant performance underscores the model’s exceptional robustness and its ability to handle diverse target and background variations.

On the challenging IRSTD-1k dataset, our model continues to exhibit strong generalization capabilities. It obtains the best mIoU (69.18%) and nIoU (69.32%), demonstrating superior segmentation accuracy. Furthermore, it achieves a highly competitive F-measure of 80.63% and maintains a low False-Alarm Rate (Fa) of 11.23, second only to SCTransNet. This result validates the effectiveness of MixMambaNet in complex and cluttered real-world environments.

Across all datasets, traditional methods like Top-Hat and WSLCM consistently yield low IoU and F-measure scores, confirming the significant advantage of deep learning-based approaches. Among these, MixMambaNet consistently positions itself as a top-performing model, validating the efficacy of its architectural design.

4.3.2. Comprehensive and Efficiency Analysis

To provide a holistic view of performance and model complexity, Table 3 presents the average metrics across all datasets, alongside the number of parameters and computational cost (FLOPs). MixMambaNet not only achieves the highest overall accuracy but also demonstrates an excellent balance between performance and efficiency. It records the best average IoU (84.18%), nIoU (87.11%), and F-measure (91.04%).

Crucially, it achieves this superior performance with a more streamlined architecture compared to its main competitors. With 10.36 M parameters and 19.48 G FLOPs, MixMambaNet is more lightweight and computationally efficient than SCTransNet (11.19 M Params, 20.24 G FLOPs) and significantly more so than UIU-Net (50.54 M Params, 54.42 G FLOPs). This favorable trade-off between accuracy and computational cost makes MixMambaNet a more practical solution for deployment in resource-constrained applications.

4.3.3. AUC Analysis for Robustness

To evaluate the model’s performance stability across different decision thresholds, we calculated the Area Under the Curve (AUC) on all three datasets, with results shown in Table 4. A higher AUC value indicates better detection performance that is less sensitive to the choice of a specific segmentation threshold. The results confirm the robustness of our method. MixMambaNet consistently achieves the highest or second-highest AUC scores across all datasets and evaluation criteria, outperforming or performing on par with the previous leading method, SCTransNet. This demonstrates that the superiority of MixMambaNet is not an artifact of a single optimal threshold but reflects a fundamentally stronger feature representation that reliably distinguishes targets from background across a wide operational range.

4.3.4. Qualitative Comparison

Figure 7 presents a visual comparison of the detection results produced by the proposed MixMambaNet and four representative infrared small target detection models (ALCNet, ACMNet, UIU-Net, DNANet) under highly challenging scenarios. The selected scenes contain diverse difficulties, including low signal-to-noise ratio (SNR), multiple closely spaced targets, and strong background clutter interference. From the results, ALCNet and ACMNet frequently miss faint targets (blue boxes) or misclassify background noise as targets (yellow boxes). UIU-Net demonstrates partial improvement, but still suffers from missed detections in extremely weak signal cases, as visible in the first and second rows. DNANet achieves cleaner segmentation masks but occasionally fails to suppress clutter, leading to false alarms in complex textures. In contrast, MixMambaNet consistently detects all genuine targets (white boxes) with precise segmentation masks that closely match the ground truth, while maintaining effective suppression of high-frequency background noise. This visual evidence reinforces our quantitative evaluation, confirming that MixMambaNet excels at enhancing target saliency while mitigating interference from challenging infrared backgrounds.

4.3.5. Robustness to Faint Signals and Clutter Suppression

To rigorously evaluate the robustness of MixMambaNet against faint target signals and complex background clutter, we conducted experiments on the aggregated test sets of the NUAA-SIRST, NUDT-SIRST, and IRSTD-1k datasets, stratifying the targets into three size categories: small (<10 pixels), medium (10–20 pixels), and large (>20 pixels). The IoU performance was compared with the current state-of-the-art baseline, SCTransNet.

Table 5 reveals a clear and compelling trend. While MixMambaNet demonstrates superior performance across all target sizes, the most substantial advantage is observed in the Small-target category (<10 pixels), with a significant IoU gain of 1.3%. This performance margin decreases for Medium-sized targets (+0.7%) and becomes minimal for Large, more easily discernible targets (>20 pixels, +0.2%).

Qualitative Comparison on Cluttered Backgrounds: To visually substantiate our model’s superior clutter suppression capability, Figure 8 presents a visual comparison between our proposed MixMambaNet and the baseline model, SCTransNet, on a selection of challenging infrared images. These scenes are chosen to highlight performance under conditions of complex backgrounds, low signal-to-noise ratio, and multiple closely spaced targets. For clarity, we use blue circles to denote missed detections and yellow circles for false alarms.

Row 1 (Building Scene): This scene features multiple small targets positioned near the sharp, high-contrast edges of a building structure. While both models successfully identify all genuine targets, SCTransNet produces a false alarm (indicated by the yellow box), misinterpreting the strong edge features as a target. In contrast, MixMambaNet achieves flawless detection, accurately identifying all targets with zero false positives. This result highlights our model’s superior ability to differentiate true targets from structured background clutter, such as building edges.

Row 2 (Sky Scene with Clustered Targets): In this scenario, a cluster of weak targets appears against a relatively uniform sky background. MixMambaNet once again demonstrates its heightened sensitivity by correctly identifying all targets. The baseline model, SCTransNet, fails to detect the faintest target on the left (indicated by the blue box), underscoring its limitations in low signal-to-noise ratio (SNR) conditions.

Row 3 (Cloud Clutter Scene): This case presents an exceptionally difficult challenge, with faint targets deeply embedded in diffuse cloud structures. SCTransNet struggles significantly, missing some targets while also producing spurious detections in the noisy cloud regions. Conversely, MixMambaNet performs perfectly, detecting all genuine targets while effectively suppressing the surrounding cloud clutter, resulting in zero missed detections and zero false alarms.

In summary, these qualitative results compellingly demonstrate that MixMambaNet establishes a new state-of-the-art balance between detection sensitivity and clutter suppression. It not only exhibits superior capability in detecting extremely faint and clustered targets but also shows remarkable robustness in rejecting complex background interference, a critical advantage over existing methods.

4.4. Ablation Study

To rigorously validate the effectiveness of each core component within our MixMambaNet architecture, we conducted a series of ablation experiments on the NUDT-SIRST dataset. Starting with a foundational baseline model, we incrementally integrated our three proposed modules: the Perception-aware Hybrid Encoder (PHE), the MixMamba Bottleneck (MMB), and the Non-local Mamba Aggregation (NMA) module. The impact of each addition was evaluated using IoU and nIoU metrics, with the results detailed in Table 6.

Baseline Model: Our baseline model employs SCTransNet as the foundational performance benchmark, achieving an IoU of 81.71%.

Effectiveness of PHE: By replacing the standard encoder with our PHE module, the IoU increased to 81.96%. This confirms the efficacy of the PHE’s parallel structure in capturing a richer, multi-paradigm feature representation at the encoding stage.

Distinct Roles of MMB and NMA: To clarify the unique contributions of the MMB and NMA modules, we analyzed their effects both independently and jointly.

First, we introduced the MMB at the bottleneck of the PHE-enhanced model. This led to a notable performance increase, raising the IoU to 82.28%. This demonstrates the critical role of the MMB in performing global scene disambiguation on the most compressed, high-level features, effectively suppressing complex background clutter before decoding.
Next, to isolate NMA’s function, we tested a configuration with PHE and NMA but without MMB. This model achieved an IoU of 82.45%. The improvement over the “Baseline + PHE” model shows that the NMA is effective at context-aware feature fusion during the decoding phase, intelligently aggregating multi-scale features from skip connections.
However, this “PHE + NMA” configuration is still outperformed by the full model (IoU 82.68%), which includes the MMB. This crucial comparison reveals that while NMA refines feature fusion, it cannot fully compensate for the absence of MMB’s dedicated global context modeling at the bottleneck. The MMB provides a holistic scene understanding that the NMA then leverages to achieve more precise boundary refinement.

Synergistic Effect: The final configuration, representing the complete MixMambaNet (PHE + MMB + NMA), achieved the best overall performance with an IoU of 82.68%. This confirms that the modules are not redundant; rather, they operate synergistically.

In summary, the ablation study systematically demonstrates that each proposed module contributes positively. The MMB excels at global context disambiguation at the bottleneck, while the NMA specializes in leveraging that context for superior feature aggregation during decoding. Their combined, synergistic effect is key to the final performance of MixMambaNet.

5. Discussion and Future Work

5.1. Discussion

Our empirical results demonstrate that MixMambaNet establishes a new state-of-the-art in infrared small target detection, excelling in both accuracy and computational efficiency. The success of our approach is primarily attributed to the strategic integration of State Space Models (SSMs) within a hybrid CNN framework. Unlike traditional CNNs, which are constrained by local receptive fields, or Transformers, which incur quadratic complexity, Mamba captures long-range dependencies with linear complexity. This is particularly well-suited for IRSTD, where a target’s context is defined by its relationship to the entire global background.

The synergy between our proposed modules—the Perception-aware Hybrid Encoder (PHE), the MixMamba Bottleneck (MMB), and the Non-local Mamba Aggregation (NMA)—validates our design philosophy. This hierarchical application of Mamba provides a rich feature foundation, models the global context effectively, and intelligently fuses it with fine-grained spatial details. To provide a holistic yet concise perspective, Table 7 systematically summarizes the primary advantages and limitations of MixMambaNet.

5.2. Future Work

The limitations identified in Table 7 directly inform our agenda for future research. We propose to extend this work along the following concrete pathways:

Enhancing Real-World Robustness: To address the current generalization boundaries, our immediate focus will be on collecting and annotating a new, large-scale dataset featuring adverse weather, diverse sensor types, and varying noise profiles. This will enable the development of more robust models through targeted data augmentation and domain generalization techniques.

Optimization for Edge Deployment: To move MixMambaNet from a high-performance model to a deployable one, we will investigate quantization-aware training (QAT) and structured pruning methods tailored for Mamba architectures. The goal is a lightweight version that meets the strict latency and memory requirements of real-time edge computing without significant accuracy loss.

Improving Architectural Interpretability: To demystify the “black box” nature of Mamba’s selective scan mechanism, we aim to develop novel visualization techniques. This research will help trace the influence of input pixels and understand the model’s decision-making process, fostering trust and enabling more targeted architectural refinements.

Addressing Data Scarcity and Extreme Cases: To reduce the dependency on large labeled datasets and improve performance on the most challenging sub-pixel or low-SNR targets, we will explore semi-supervised learning and physics-informed neural network (PINN) concepts. This could embed physical properties of target signatures directly into the model, enhancing its capability in data-scarce and extreme scenarios.

6. Conclusions

In this paper, we introduced MixMambaNet, a novel and efficient architecture for infrared small target detection that pioneers the synergistic integration of State Space Models within a convolutional framework. By meticulously designing a Perception-aware Hybrid Encoder, a MixMamba Bottleneck, and a Non-local Mamba Aggregation module, our model adeptly captures both fine-grained local features and essential long-range global context, which are critical for distinguishing small targets from complex and cluttered backgrounds. Comprehensive experiments have validated that MixMambaNet not only surpasses existing state-of-the-art methods across multiple public benchmarks in detection accuracy but also achieves this with superior computational efficiency. This work demonstrates the significant potential of State Space Models as a powerful and efficient backbone for pixel-level perception tasks and charts a promising new direction for future research in the field of infrared small target detection.

Author Contributions

Z.Z. and S.Y. conceived and designed the experiments; Z.Z. performed the experiments; Z.Z. and S.Y. analyzed the data; Z.Z. contributed reagents/materials/analysis tools; S.Y. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Kang, W.; Zhao, W.; Liu, X. MLEDNet: Multi-Directional Learnable Edge Information-Assisted Dense Nested Network for Infrared Small Target Detection. Electronics 2025, 14, 3547. [Google Scholar] [CrossRef]
Qu, Y.; Wang, C.; Xiao, Y.; Ju, H.; Wu, J. Dynamically Optimized Object Detection Algorithms for Aviation Safety. Electronics 2025, 14, 3536. [Google Scholar] [CrossRef]
Wang, X.; Li, L.; Liu, J.; Huang, Z.; Li, Y.; Wang, H.; Zhang, Y.; Yu, Y.; Yuan, X.; Qiu, L.; et al. Infrared Bionic Compound-Eye Camera: Long-Distance Measurement Simulation and Verification. Electronics 2025, 14, 1473. [Google Scholar]
Rawat, S.S.; Verma, S.K.; Kumar, Y. Review on recent development in infrared small target detection algorithms. Procedia Comput. Sci. 2020, 167, 2496–2505. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-Mean and max-median filters for detection of small targets. In Signal and Data Processing of Small Targets 1999; SPIE: Bellingham, WA, USA, 1999; pp. 74–83. [Google Scholar]
Wang, X.; Lv, G.; Xu, L. Infrared dim target detection based on visual attention. Infrared Phys. Technol. 2012, 55, 513–521. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. Sctransnet: Spatial-channel cross transformer network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 950–959. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU-Net: Multilevel TransUNet for space-based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Liu, W.; Lu, X.; Zhang, J.; Li, D.; Zhang, X. MOU-Mamba: Multi-Order U-shape Mamba for infrared small target detection. Opt. Laser Technol. 2025, 187, 112851. [Google Scholar] [CrossRef]
Caparos, S.; Linnell, K.J.; Bremner, A.J.; de Fockert, J.W.; Davidoff, J. Do local and global perceptual biases tell us anything about local and global selective attention? Psychol. Sci. 2013, 24, 206–212. [Google Scholar]
Lu, L.; Xiong, Q.; Xu, B.; Chu, D. Mixdehazenet: Mix structure block for image dehazing network. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–10. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Shi, B.; Gai, S.; Darrell, T.; Wang, X. Refocusing is key to transfer learning. arXiv 2023, arXiv:2305.15542, 2023. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Tan, F.; Xi, Y.; Zheng, H.; Li, N. ISTDU-Net: Infrared Small-Target Detection U-Net. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]

Figure 1. Displays images of small infrared targets in real-world scenarios. Specifically: (a) depicts a sky scene; (b) depicts a high-noise bright scene; (c) depicts a tree scene; (d) depicts a high-noise dim scene; (e) depicts a building-dense scene; (f) depicts a thick cloud scene.

Figure 2. Illustrates the macro-architecture of MixMambaNet. The primary work centers on three major modules: (1) the Perceptual Hybrid Encoder (PHE), designed to enhance faint target signals at the feature extraction source; (2) The Hybrid Mamba Bottleneck Module (MMB), which achieves efficient linear-time global context aggregation via state-space modeling; (3) The Non-Local Mamba Aggregation Module (NMA), designed to enable intelligent noise suppression fusion through skip connections.

Figure 3. Diagram of the PHE module, showing the parallel LPA and MPCA branches and their fusion.

Figure 4. Diagram of the MMB module, illustrating the 2D-to-sequence unrolling, the multi-directional Mamba scans, and the fusion process.

Figure 5. Diagram of the NMA module, showing the multi-scale dilated convolutions, the SS2D Mamba block, and the output to the decoder.

Figure 6. Performance comparison of MixMambaNet with state-of-the-art methods on public datasets. (a) Model performance on the NUAA-SIRST dataset. (b) Model performance on the IRSTD-1k dataset. (c) Model performance on the NUDT-SIRST dataset. In these scatter plots, the x-axis represents Intersection over Union (IoU, %), the y-axis represents Probability of Detection (Pd, %), and the size of each bubble corresponds to the F-measure. Our proposed method (“Ours”) is highlighted in yellow. (d) A comparative analysis of model efficiency, plotting IoU (%) against the number of parameters (Params, in millions). Our model achieves the highest IoU with significantly fewer parameters compared to most leading methods, demonstrating a superior balance of accuracy and efficiency.

Figure 7. Visualization results obtained using different IRSTD methods on the NUAA-SIRST, NUDT-SIRST, and IRSTD-1k datasets. To facilitate detailed inspection, each result panel is composed of two views: the main detection result is shown in the top-right, while a magnified inset of the target area is provided in the bottom-left. The colored squares serve as indicators: white for correctly detected targets, blue for missed detections, and yellow for false positives.

Figure 8. Qualitative comparison of detection results on challenging infrared scenes. From left to right: (a) Original Image, (b) Ground Truth (red boxes), (c) Our Method (MixMambaNet), (d) SCTransNet. We use yellow boxes to indicate false alarms (FA) and blue boxes for missed detections (MD).

Table 1. Comparisons with SOTA methods on NUAA-SIRST, NUDT-SIRST and IRSTD-1k in IoU (%), nIoU (%), and F-measure (%).

Method	NUAA-SIRST			NUDT-SIRST			IRSTD-1k
Method	loU	nIoU	F-Measure	loU	nIoU	F-Measure	loU	nIoU	F-Measure
ACM (2020) [11]	68.93	69.18	80.87	61.12	64.40	75.87	59.23	57.03	74.38
ALCNet (2020) [17]	70.83	71.05	82.92	64.74	67.20	78.59	60.60	57.14	75.47
RDIAN (2021) [27]	68.72	75.39	81.46	76.28	79.14	86.54	56.45	59.72	72.14
ISTDU(2022) [28]	75.52	79.73	86.06	89.55	90.48	94.49	66.36	63.86	79.58
MTU-Net (2022) [15]	74.78	78.27	85.37	74.85	77.54	84.47	66.11	63.24	79.26
DNA-Net (2022) [12]	75.80	79.20	86.24	88.19	88.58	93.73	65.90	66.38	79.44
UIU-Net (2022) [14]	76.91	79.99	86.95	93.48	93.89	96.63	66.15	66.66	79.63
SCTransNet (2024) [9]	77.50	81.08	87.32	94.09	94.38	96.95	68.03	68.15	80.96
MOU-Mamba (2025) [21]	81.90	80.93	90.35	91.61	93.42	95.62	69.25	65.65	81.79
Ours	77.66	82.00	87.17	95.33	95.68	97.17	69.18	69.32	80.63

Table 2. Comparisons with SOTA methods on NUAA-SIRST, NUDT-SIRST and IRSTD-1k in Pd (%), Fa (10⁻⁶).

Method	NUAA-SIRST		NUDT-SIRST		IRSTD-1k
Method	Pd	Fa	Pd	Fa	Pd	Fa
ACM (2020) [11]	91.63	15.23	93.12	55.22	93.27	65.28
ALCNet (2020) [17]	94.30	36.15	94.18	34.61	92.98	58.80
RDIAN (2021) [27]	93.54	43.29	95.77	34.56	88.55	26.63
ISTDU (2022) [28]	96.58	14.54	97.67	13.44	93.60	53.10
MTU-Net (2022) [15]	93.54	22.36	93.97	46.95	93.27	36.80
DNA-Net (2022) [12]	95.82	8.78	98.83	9.00	90.91	12.24
UIU-Net (2022) [14]	95.82	14.13	98.31	7.79	93.98	22.07
SCTransNet (2024) [9]	96.95	13.92	98.62	4.29	93.27	10.74
Ours	97.24	14.33	98.77	5.37	93.44	11.23

Table 3. The average metrics of different methods on three datasets.

Method	Params (M)	Flops (G)	IoU	nIoU	F-Measure
DNA-Net [12]	4.697	14.26	80.23	82.59	88.60
UIU-Net [14]	50.54	54.42	82.4	86.12	90.35
SCTransNet [9]	11.19	20.24	83.43	86.86	90.96
Ours	10.36	19.48	84.18	87.11	91.04

Table 4. The Area Under Curve (AUC) with different thresholds of the SOTA methods on the NUAA-SIRST, NUDT-SIRST, and IRSTD-1k datasets.

Dataset	Index	ACM [11]	ALCNet [17]	RDIAN [27]	ISTDU [28]	MTU-Net [15]	DNA-Net [12]	UIU-Net [14]	SCTransNet [9]	Ours
NUAA-SIRST	AUCFa = 0.5	0.7223	0.8618	0.5461	0.7515	0.7457	0.6582	0.4854	0.9539	0.9571
NUAA-SIRST	AUCFa = 1	0.818	0.9025	0.7321	0.8579	0.8437	0.8098	0.7197	0.9589	0.9612
NUDT-SIRST	AUCFa = 0.5	0.4392	0.6321	0.463	0.8635	0.464	0.63	0.8275	0.9853	0.9874
NUDT-SIRST	AUCFa = 1	0.5865	0.7716	0.6695	0.9211	0.6064	0.8072	0.9013	0.9863	0.0883
IRSTD-1k	AUCFa = 0.5	0.5374	0.6606	0.4545	0.6014	0.5018	0.6162	0.4749	0.9107	0.9156
IRSTD-1k	AUCFa = 1	0.7366	0.8006	0.648	0.7687	0.7198	0.7684	0.7099	0.9200	0.9238

Table 5. IoU performance (%) on targets of different sizes on the aggregated test sets of NUAA-SIRST, NUDT-SIRST, and IRSTD-1k.

Method	Small (<10 px)	Medium (10–20 px)	Large (>20 px)
SCTransNet	66.5	74.2	84.1
Ours	67.8 (+1.3)	75.9 (+0.7)	84.3 (+0.2)

Table 6. Ablation studies on different modules: the perceptual hybrid feature encoder, MixMamba module, and non-local Mamba aggregation module on the NUAA-SIRST, NUDT-SIRST, and IRSTD-1k datasets, including the average intersection-over-union ratio (IoU), nIoU and F-measure. Where √ indicates the inclusion of the module, and × indicates the exclusion of the module.

Baseline	PHE	MMB	NMA	IoU	nIoU	F-Measure
√	×	×	×	0.8171	0.823	0.8918
√	√	×	×	0.8196	0.8284	0.8972
√	√	×	√	0.8245	0.8355	90.20
√	√	√	×	0.8228	0.8348	0.9011
√	√	√	√	0.8268	0.8367	0.9031
√	√	√	√	0.8268	0.8367	0.9031

Table 7. Advantages and Limitations of MixMambaNet.

Aspect	Advantages	Limitations
Detection Accuracy	State-of-the-art performance on all benchmark datasets (NUAA-SIRST, NUDT-SIRST, IRSTD-1k) with highest IoU, nIoU, and F-measure scores.	May produce occasional false alarms in scenes with extremely high-contrast edges or structured noise, as the model is tuned for high sensitivity to faint signals.
Computational Efficiency	Linear complexity for global modeling via Mamba blocks. Fewer parameters (10.36 M) and FLOPs (19.48 G) than comparable methods like UIU-Net and competitive with SCTransNet.	Not yet optimized for real-time edge deployment. Further model compression (pruning, quantization) required for resource-constrained devices.
Robustness	Superior performance on small and faint targets (<10 pixels). Effective clutter suppression through global context modeling. Demonstrated robustness under simulated sensor noise.	Generalization to unseen sensor types (e.g., different IR wavelengths) and extreme weather conditions (rain, fog) has not yet been systematically validated.
Architectural Innovation	Novel integration of State Space Models (Mamba) with CNNs for IRSTD. Hierarchical use of Mamba at the bottleneck (MMB) and decoder (NMA) stages.	The interpretability of Mamba’s selective scan mechanism is not yet fully understood. Further visualization and analysis are needed.
Practical Applicability	Strong potential for defense, aviation, and surveillance applications due to high accuracy and efficiency balance.	Requires large labeled datasets for training. Performance on sub-pixel or extremely low SNR targets remains challenging.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Yin, S. MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD. Electronics 2025, 14, 4527. https://doi.org/10.3390/electronics14224527

AMA Style

Zhang Z, Yin S. MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD. Electronics. 2025; 14(22):4527. https://doi.org/10.3390/electronics14224527

Chicago/Turabian Style

Zhang, Zikang, and Songfeng Yin. 2025. "MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD" Electronics 14, no. 22: 4527. https://doi.org/10.3390/electronics14224527

APA Style

Zhang, Z., & Yin, S. (2025). MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD. Electronics, 14(22), 4527. https://doi.org/10.3390/electronics14224527

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MixMambaNet: Hybrid Perception Encoder and Non-Local Mamba Aggregation for IRSTD

Abstract

1. Introduction

2. Related Work

2.1. Traditional Model-Driven Methods

2.2. CNN-Based Methods

2.3. Advanced Methods

2.4. The Emergence of Mamba-Based Methods

3. Approach

3.1. Overall Architecture

3.2. Perception-Aware Hybrid Encoder

3.2.1. Motivation

3.2.2. Architecture

3.3. MixMamba Bottleneck

3.3.1. Motivation

3.3.2. Architecture

3.4. Non-Local Mamba Aggregation

3.4.1. Motivation

3.4.2. Architecture

4. Experimental

4.1. Evaluation Metrics

4.1.1. Accuracy and Detection Performance Metrics

4.1.2. Model Complexity and Efficiency Metrics

4.2. Experiment Settings

4.3. Quantitative Results

4.3.1. Performance on Individual Datasets

4.3.2. Comprehensive and Efficiency Analysis

4.3.3. AUC Analysis for Robustness

4.3.4. Qualitative Comparison

4.3.5. Robustness to Faint Signals and Clutter Suppression

4.4. Ablation Study

5. Discussion and Future Work

5.1. Discussion

5.2. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI