A Wide and Shallow Network Tailored for Infrared Small Target Detection

Pengsen Lu; Yihan Luo; Xinyu Zhang; Haolong Jia; Shiye Xia; Yaqing Liu

doi:10.3390/rs18020307

Highlights

What are the main findings?

Extremely Lightweight Model: WSNet achieves state–of–the–art efficiency with only 0.054 M parameters and 1.050 G FLOPs, making it the lightest model to date in the field of Infrared Small Target Detection (IRSTD).
Wide and Shallow Architecture: Contrary to conventional deep networks, WSNet adopts a wide and shallow design, which is more suitable for infrared images that lack rich semantic information. Excessive depth leads to performance degradation in IRSTD.
Superior Performance–Speed Trade–off: WSNet achieves competitive detection accuracy (e.g., highest IoU on SIRST, and best Pd on NUDT–SIRST) while offering the fastest inference speed (up to 146 FPS on GPU, 30 FPS on CPU).

What is the implication of the main finding?

Practical Deployment in Resource–Limited Environments: WSNet’s lightweight design and real–time CPU compatibility enable its deployment in embedded systems, drones, and portable infrared devices, where computational resources are limited but low–latency detection is critical.
Paradigm Shift in IRSTD Architecture Design: The success of a wide and shallow network challenges the prevailing “deeper is better” assumption in deep learning for IRSTD, encouraging the community to reconsider architecture tailoring based on domain–specific characteristics.

Abstract

Designing lightweight yet competitive models remains a challenging problem across the computer vision community—Infrared Small Target Detection (IRSTD) is no exception. To address this challenge, we propose WSNet, a novel model that achieves competitive performance while significantly reducing computational cost and memory consumption, without relying on deeper architectures or complex fusion mechanisms. The core innovation of WSNet lies in its extremely simple yet highly efficient network architecture, tailored to the specific demands of the IRSTD task. To the best of our knowledge, WSNet is the lightest existing model in the IRSTD field, containing only 0.054 M parameters—hundreds of times fewer than state–of–the–art alternatives—and requiring merely 1.050 G FLOPs. Extensive experiments on multiple benchmark datasets show that WSNet not only performs on par with leading methods but also delivers substantially faster inference speeds, making it highly suitable for real–time applications on embedded and resource–constrained devices.

Keywords:

infrared small target detection; wide and shallow network; lightweight model; real–time inference

1. Introduction

Infrared imaging offers unique advantages over radar and visible light imaging, such as strong anti–interference capabilities, smoke penetrability, and adaptability to both day and night scenes [1,2]. As a critical downstream task, Infrared Small Target Detection (IRSTD) has become a key focus in various infrared applications. IRSTD plays an essential role in diverse fields, including traffic management, maritime rescue, and military operations [3,4,5]. However, due to long–distance imaging and significant interference from complex backgrounds, targets in infrared images are often very small and dim, typically covering less than

30 \times 30

pixels [6]. Furthermore, thermal imagers capture only thermal radiation, lacking sufficient hue and texture information needed for effective feature extraction, making Infrared Small Target Detection especially challenging.

Traditional methods, such as those cited in [7,8,9,10,11,12,13,14,15,16,17], have made significant historical contributions to the field. However, they are fundamentally limited by their reliance on handcrafted features and conventional image processing techniques, which inherently lack the adaptive generalization capacity needed for robust performance in complex real–world environments.

In recent years, deep learning–based methods have achieved notable progress in Infrared Small Target Detection (IRSTD). Unlike traditional approaches, deep learning models can automatically extract target–related features and contextual cues from the background, thereby significantly improving generalization. Representative works include those cited in [6,18,19,20,21,22,23,24,25,26,27,28,29,30]. However, as shown in Figure 1, these models often require substantial computational and memory resources, limiting their practicality for real–world deployment. This inefficiency primarily stems from the adoption of very deep network architectures inherited from conventional vision tasks, which overlook a key characteristic of IRSTD: infrared targets and their surroundings typically contain limited semantic information, as illustrated in Figure 2. This property reduces the necessity for deep hierarchies and suggests that more efficient designs are possible without sacrificing performance.

Figure 1. The trend of Params and FLOPs in mainstream deep learning–based methods for IRSTD. Although models such as MSHNet and later approaches have achieved significant reductions in both parameters and computational cost, they still typically contain several million parameters and require several GFLOPs of computation. This level of complexity makes them unrealistic for deployment on embedded and resource–constrained hardware.

Figure 2. Comparison between infrared images and conventional images. Compared to conventional images, targets and their surrounding environments in infrared images generally lack rich semantic information due to the absence of hue and detailed texture [31]. (a–c): Infrared images; (d–f): conventional images. The object inside red box is our detection target.

Motivated by this observation, we propose WSNet—a lightweight network that diverges from common architectural paradigms in IRSTD, such as U-Nets, ResNets, and Vision Transformers. Instead of pursuing greater depth, our design emphasizes a shallow yet wide structure to enhance representational capacity efficiently. Moreover, we introduce a streamlined building block that replaces conventional concatenation and complex fusion modules with simple yet effective summation and multiplication operations. This design offers a new perspective for efficient model development in IRSTD and may inspire further research into lightweight detection architectures.

The contributions of this study are as follows:

(1): We design an extremely simple and efficient architecture named WSNet, which contains only 0.054 M parameters and requires 1.050 G FLOPs. It represents the most lightweight model in the field of IRSTD and achieves the fastest inference speed to date. The code is available at https://github.com/CPaul33/WSNet on 15 January 2025.
(2): To fit the scarcity of hierarchical semantic content in IRSTD tasks, we introduce a Width Extension Module (WEM) that enhances feature representation by expanding the network’s width. Furthermore, building upon the CBAM module [32], we propose a customized Channel–Spatial Hybrid Attention (CSHA) mechanism. This is the first work to incorporate Lp Pooling into such a structure, which effectively smooths outliers while preserving the overall trend of features.
(3): Extensive experiments on multiple benchmark datasets—SIRST [19], NUDT–SIRST [20], and IRSTD–1K [21]—show that WSNet achieves performance comparable to state–of–the–art models, while delivering the fastest inference speed, several times faster than current SOTA approaches. Furthermore, experiments demonstrate that WSNet can be directly deployed on resource–constrained devices such as CPUs while still maintaining real–time detection capability, making it highly suitable for practical real–time applications and large–scale deployment in real–world IRSTD scenarios.

3. Methodology

3.1. Overall Architecture

The overall architecture of WSNet is illustrated in Figure 4. The network processes infrared images

X \in R^{1 \times H \times W}

as input, which are extracted to initial infrared feature maps

X^{'} \in R^{C \times H \times W}

through a

3 \times 3

convolutional layer. The architecture subsequently diverges into two parallel branches: The main branch sequentially performs: (1) down–sampling via a

3 \times 3

convolutional layer with stride 2, generating low–resolution feature maps

X_{l} \in R^{2 C \times \frac{H}{2} \times \frac{W}{2}}

; (2) WEM to enhance feature representation across a broader spectrum; (3) CSHA to suppress irrelevant noise while emphasizing salient features; and (4) up–sampling through bilinear interpolation to reconstruct high–resolution feature maps

X_{h} \in R^{2 C \times H \times W}

. In parallel, the skip connection branch implements a

1 \times 1

convolutional layer to adjust the dimensionality of feature maps

X_{r} \in R^{2 C \times H \times W}

. To optimize computational efficiency, we employ an element–wise addition operation instead of complex fusion modules, effectively reducing both model parameters and FLOPs. The network culminates in a Fully Convolutional Network Head (FCN Head) [57] followed by a Sigmoid activation function, producing the final prediction output

O \in R^{1 \times H \times W}

.

Figure 4. The overall architecture of the proposed WSNet. It leverages a highly streamlined architecture, comprising only a few convolutional layers and the attention–guided mechanism, to deliver compelling model performance.

The process described above can be expressed mathematically as

\begin{matrix} X^{'} & = C o n v_{3 \times 3} (X), \end{matrix}

(1)

\begin{matrix} X_{l} & = C o n v_{3 \times 3} (X^{'}) s t r i d e = 2, \end{matrix}

(2)

\begin{matrix} X_{w} & = W E M (X_{l}), \end{matrix}

(3)

\begin{matrix} X_{s} & = C S H A (X_{w}), \end{matrix}

(4)

\begin{matrix} \begin{matrix} O = & σ {F C N {S u m [B i l i n e a r (X_{s}), C o n v_{1 \times 1} (X^{'})]}} \end{matrix} \end{matrix}

(5)

where

S u m

and

B i l i n e a r

denote element–wise addition and bilinear interpolation, respectively, and

σ

denotes the Sigmoid activation function.

3.2. Width Extension Module

Wider networks, with more neurons (or blocks) per layer, provide greater expressive power, enabling them to capture diverse features simultaneously [47].

In light of this, we design a custom Width Extension Module (WEM) to enhance the model’s feature representation capability through systematic network widening. The detailed structure of WEM is illustrated in Figure 5. The module integrates multiple computational pathways: the yellow pathway uses Max Pooling to highlight the most salient regions of the feature maps, which typically contain target–related information; the blue pathway utilizes convolutional layers with progressively increasing kernel sizes (3 × 3, 5 × 5, and 7 × 7) to enrich feature diversity—enabling the network to capture visual features at multiple scales within the same level, where the 3 × 3 kernel extracts local patterns, while the 5 × 5 and 7 × 7 kernels provide a wider range of patterns and spatial correlations; and the red pathway utilizes dilated convolutions with different dilation rates to capture contextual information—an aspect often under–represented in shallow networks. To reduce computational complexity, we incorporate 1 × 1 convolutional kernels in each branch to reduce the number of channels while preserving critical feature representations. We then use summation (

S u m

) rather than concatenation (

C o n c a t

) to integrate the multi–scale features. This choice is motivated by two factors: first, the multi–scale features exhibit semantic similarity and have identical dimensions, making them suitable for direct fusion; second, summation is computationally more efficient than concatenation, as it does not increase the channel count or computational overhead—a conclusion supported by the ablation study in Section 4.5, Impact of the Design Choice of WEM. It is also worth noting that the multi–branch architecture tends to introduce extraneous noise (particularly due to the inclusion of Max Pooling), which can mislead subsequent layers and lead to higher false alarm (FA) rates. To mitigate this risk, we apply a 3 × 3 convolutional layer for feature refinement.

Figure 5. The detailed structure of Width Extension Module (WEM). The WEM consists of multiple branches, each with a different receptive field (RF).

The following is the mathematical formula expression for this module:

\begin{matrix} W B_{1} = & C o n v_{1 \times 1} [M a x P o o l i n g (X_{l})], \end{matrix}

(6)

\begin{matrix} W B_{n 1} = & C o n v_{k \times k} [C o n v_{1 \times 1} (X_{l})], n_{1} \in {2, 3, 4}, k \in {3, 5, 7} \end{matrix}

(7)

\begin{matrix} W B_{n 2} = & D C o n v_{5 \times 5} [C o n v_{1 \times 1} (X_{l})], n_{2} \in {5, 6} \end{matrix}

(8)

where

C o n v_{k \times k}

(

k \in {3, 5, 7}

) denotes a

k \times k

convolutional layer, and

D C o n v_{5 \times 5}

represents a

5 \times 5

dilated convolutional layer with a dilation rate of either 3 or 4. The output of each Branch in WEM is denoted as

W B_{n}

, where

n \in {1, 2, \dots, 6}

indicates the number of branches in WEM. Note that in Equations (7) and (8),

n_{1} \in {2, 3, 4}

corresponds to the

k \times k

convolutional branches, while

n_{2} \in {5, 6}

corresponds to the dilated convolutional branches.

Eventually, the final output of WEM can be attained by sequentially performing the

S u m

operation and the

3 \times 3

Convolutional layer:

\begin{matrix} X_{w} & = C o n v_{3 \times 3} [S u m (W B_{1}, W B_{n 1}, W B_{n 2})] \end{matrix}

(9)

3.3. Channel–Spatial Hybrid Attention

The attention mechanism can significantly enhance detection performance by guiding the model to focus on regions with distinctive features, thereby reducing the chances of missing small targets. Additionally, by assigning lower weights to irrelevant regions, such as cluttered backgrounds or noisy areas, the attention mechanism effectively filters out unnecessary information, making it highly suitable for addressing the challenges of the IRSTD task.

Inspired by this, we build up our Channel–Spatial Hybrid Attention (CSHA) module which has two sequential sub–modules: channel attention (CA) and Spatial Attention (SA).

The motivation behind CA lies in our belief that Max and Average Pooling used in conventional channel attention (e.g., SENet [55], and CBAM [32]) are suboptimal for IRSTD tasks, whereas Lp Pooling offers a better alternative. Specifically: (1) Max Pooling is sensitive to outliers (e.g., extremely high activation values), potentially amplifying noise; (2) Average Pooling tends to dilute salient features and lacks robustness to noise; (3) Lp Pooling (e.g., when p = 2) balances these issues by smoothing outliers while preserving overall feature trends—making it more suitable for the inherently noisy nature of infrared images, specially considering the purpose of channel attention is to better promote feature dimensionality reduction and inter–channel information aggregation. Our ablation studies (described in Section 4.5) confirm this: CSHA with Lp Pooling significantly outperforms SENet and CBAM in IRSTD performance.

The mathematical definition of the

L_{p}

Pooling can be formulated as

\begin{matrix} y & = {(\frac{1}{n} \sum_{i = 1}^{n} {| X_{i} |}^{p})}^{\frac{1}{p}} \end{matrix}

(10)

where

X_{i} (i \in {1, 2, \dots, n})

, n and y denote the input feature maps, and the total numbers of input feature maps and outcome of the

L_{p}

Pooling, respectively. Furthermore,

p \in {1, 2, \dots, \infty}

is a parameter used to control pooling behavior, and here, in this study, we set

p = 2

. Similar to conventional channel attention mechanisms, we compute Equation (10) over the spatial dimensions.

For SA, our goal is to guide the model to focus on regions with distinctive features—such as the salient parts of the target—while minimizing the introduction of noise. To achieve this, we employ both Max Pooling and Average Pooling: Max Pooling highlights the most prominent features, while Average Pooling provides a smooth representation by considering all values within the region.

The feature maps (here is

X_{w} \in R^{2 C \times \frac{H}{2} \times \frac{W}{2}}

, the output feature maps of WEM) as input are fed into the CSHA where it sequentially performs CA and SA to gain channel–guided weight and spatial–guided weight. In detail, Figure 6 and Figure 7 depict the computation process.

Figure 6. The channel attention (CA) of our CSHA. (a) Channel attention of the typical CBAM [32]. (b) Redesigned channel attention of our CSHA.

Figure 7. The spatial attention (SA) of our CSHA.

Mathematically, the whole process of CSHA can be, respectively, formulated as

\begin{matrix} X_{c} = & σ {M L P [L_{p} P o o l i n g (X_{w})]} \otimes X_{w}, \end{matrix}

(11)

\begin{matrix} \begin{matrix} X_{s} = & σ {C o n v_{7 \times 7} {C o n c a t [M a x P o o l i n g (X_{c}), \\ A v g P o o l i n g (X_{c})]}} \otimes X_{c} \end{matrix} \end{matrix}

(12)

where

M L P

denotes a multi–layer perceptron with one hidden layer, and

C o n c a t

denotes the concatenation operation.

σ

and ⊗ denotes the Sigmoid function and the element–wise multiplication, respectively.

3.4. Fully Convolutional Network Head

After finishing the processes described in Section 3.2 and Section 3.3, we incorporate skip connection to integrate low–level and high–level features, and subsequently employ a Fully Convolutional Network (FCN) Head to generate the final output prediction

O \in R^{1 \times H \times W}

.

The process can be formulated as

\begin{matrix} \begin{matrix} O = & σ {F C N {S u m [B i l i n e a r (X_{s}), C o n v_{1 \times 1} (X^{'})]}} \\ = & σ {C o n v_{1 \times 1} {C o n v_{3 \times 3} {S u m [B i l i n e a r (X_{s}), \\ C o n v_{1 \times 1} (X^{'})]}}} \end{matrix} \end{matrix}

(13)

4. Experiment

4.1. Implementation Details

Evaluation Metrics. To provide a comprehensive evaluation, we use six performance metrics to assess the effectiveness of different methods: Intersection over Union (

I o U

), Probability of Detection (

P d

), False Alarm Rate (

F a

), Frames Per Second (

F P S

), Model Parameters (

P a r a m s

), and Floating–Point Operations (

F L O P s

). Note that all these metrics refer to the average value. These metrics allow us to evaluate both the detection accuracy and the efficiency of the models, offering a well–rounded comparison across various approaches.

Dataset. To demonstrate the superior performance of the proposed method and evaluate its generalization capability, we conduct experiments using three well–known datasets: SIRST [19], NUDT–SIRST [20], and IRSTD–1K [21]. In line with previous works on these datasets [20,21], we adopt the same experimental settings. Specifically, we set the train–to–test ratio to 1 for both the SIRST and NUDT–SIRST datasets, and the train–to–test ratio of 4 for the IRSTD–1K dataset. This setup ensures consistency with prior research while allowing us to effectively assess the proposed method’s performance across different data distributions.

Experimental Settings. During training, all images were normalized and randomly cropped into patches of size 256 × 256, except for the IRSTD–1K dataset, where the patch size was set to 512 × 512. To augment the training data, we applied random flipping and rotation. The Soft–IoU loss function as well as the Adam optimizer with a batch size of 8 and an initial learning rate of 5 × 10⁻⁴ were used. The learning rate was reduced by a factor of 10 at the 200th and 300th epochs. Training was terminated after 500 epochs. A fixed segmentation threshold of 0.5 was applied during inference to filter out low–response regions and retain high–response areas for the final prediction. All models were implemented using the PyTorch 1.2.0 or higher framework and were both trained and tested on an NVIDIA GeForce RTX 4090 GPU to ensure a consistent experimental environment. For fair comparison, in the comparison experiments, we used the same hyper–parameters and experimental settings as reported in the original papers of the competing methods. All models were retrained from scratch on the three datasets to ensure consistency.

4.2. Comparison to State–of–the–Art Methods

In this subsection, we compare our proposed WSNet model to nine traditional methods—Top–Hat [7], Max–Median [8], WSLCM [12], TLLCM [11], IPI [13], NRAM [14], RIPT [15], PSTNN [16], and MSLSTIPT [17]—as well as nine state–of–the–art (SOTA) deep learning–based methods, including ISNet [21], DNA–Net [20], UIU–Net [6], AGPCNet [23], MTU–Net [22], MSHNet [24], SCTransNet [28], BGM [29], and HFMNet [30].

Quantitative results are presented in Table 1. Our WSNet model achieves competitive performance across multiple evaluation metrics while exhibiting superior efficiency compared to all other state–of–the–art deep learning–based methods. Specifically, WSNet has the smallest number of parameters (only 0.054 M—more than ten times fewer than the previous best model) and the lowest computational cost (only 1.050 G FLOPs—an order of magnitude lower than the best–performing alternative). It also delivers the highest inference speed, reaching 119

F P S

, 146

F P S

, and 60

F P S

on the SIRST [19], NUDT–SIRST [20], and IRSTD–1K [21] datasets, respectively, which is several times faster than previous SOTA models. A key challenge was maintaining high detection performance while achieving such extreme efficiency. Remarkably, WSNet excels in this regard, attaining outstanding results across various detection metrics. For instance, it achieves the highest

I o U

on SIRST and the best

P d

on NUDT–SIRST. Although minor decreases are observed in certain metrics, WSNet’s exceptional efficiency more than compensates for these slight shortcomings. Overall, WSNet represents the best trade–off currently available in terms of computational cost, inference speed, and accuracy for infrared small target detection.

Table 1. Quantitative results of different SOTA methods. Results for the metrics of

P a r a m s (M)

,

F L O P s (G)

,

F P S

,

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

in different datasets are presented. ‘↓’ denotes lower is better and ‘↑’ denotes higher is better. The best results are in bold.

Qualitative results are shown in Figure 8. We present several representative detection examples to highlight the superiority of our proposed method. In the first row of images, traditional methods are clearly hindered by noise interference, leading to a substantial number of false alarms. In contrast, the second row demonstrates that WSNet achieves significantly better fine–grained segmentation. This improvement can be attributed to the enhanced feature representation capability provided by our WEM. Traditional methods and even some previous deep learning–based approaches struggle to effectively extract features from small–scale and low–contrast targets, often resulting in false alarms and missed detections. However, WSNet effectively addresses these challenges by sequently incorporating WEM and CSHA, achieving superior performance in terms of both fine–grained segmentation and overall detection accuracy. Even though some false alarms and missed detections may still occur in semantically rich scenarios, this effect can be alleviated by stacking additional convolutional layers in the initial stage of the network.

Figure 8. Qualitative comparison with state–of–the–art methods on several infrared images. Correctly detected targets are marked in red, missed targets in blue, and false alarms in yellow. For better visualization, a close–up view of the target is shown in a larger box.

4.3. Comparison to Lightweight Methods

To further emphasize the superiority of our approach, we compare our WSNet model with several of the most lightweight methods in the IRSTD field, including ACM [19], ALCNet [58], ISNet [21], RDIAN [53], LW–IRSTNet [46], RepISD [45], and HFMNet–Tiny [30].

Quantitative results are presented in Table 2. Compared to other lightweight models in the IRSTD field, WSNet not only maintains superior efficiency—remaining the most lightweight approach—but also achieves the fastest inference speed. Specifically, WSNet runs approximately 1.5× faster in FPS than the previous fastest model. In terms of detection performance, WSNet also outperforms other lightweight models across most evaluation metrics, further solidifying its status as the optimal trade–off among computational cost, inference speed, and accuracy.

Table 2. Quantitative results of various lightweight models. Results for the metrics of

P a r a m s (M)

,

F L O P s (G)

,

F P S

,

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

in different datasets are presented. The best results are in bold.

4.4. Deployment in Resource–Constrained Device

To evaluate the practical applicability of the proposed model in real–world scenarios, we conducted comparative experiments on a resource–constrained platform (Intel Xeon Platinum 8352V CPU). For simplicity, a set of representative models—including both state–of–the–art and lightweight approaches—were deployed on the same CPU platform. As shown in Table 3, WSNet achieves the highest inference speed among all models, reaching 30

F P S

, which is approximately twice as fast as the compared alternatives. This performance meets the basic requirements for real–time detection even under limited computational resources.

Table 3. Inference speed comparison evaluated on the SIRST dataset using an Intel Xeon Platinum 8352 V CPU.

4.5. Ablation Study

In this subsection, we compare our WSNet with several variants to investigate the potential advantages brought by our network modules and design choices.

Why Width, Not Depth? This design choice may seem counterintuitive from the perspective of conventional deep learning theory, where deep and narrow architectures are typically considered more effective for most computer vision tasks. However, this assumption often does not hold in the context of Infrared Small Target Detection (IRSTD). To validate our perspective, we conducted an experiment by replacing the Width Extension Module (WEM) with standard ResBlocks [59], progressively stacking them to construct deeper networks. This allows us to examine how increasing model depth affects performance. As shown in Table 4, increasing network depth does not lead to consistent gains across benchmark datasets. Beyond a certain depth, performance declines—a trend attributable to the limited semantic complexity in infrared images, which makes excessively deep architectures unnecessary and often counterproductive. Furthermore, infrared targets are typically small and dim, making them prone to overfitting and information loss (fine–grained feature of target) in deeper layers. Although techniques such as multi–layer feature fusion can mitigate this issue, they often introduce significant computational and parameter overhead. Given these observations, we shift focus from depth to width, offering a more efficient and task–appropriate strategy for IRSTD.

Table 4. Results by stacking different numbers of ResBlocks. The parts in bold font are the best results.

Impact of WEM and CSHA. As shown in Table 5, we conduct ablation studies to validate the effectiveness of WEM and CSHA on different datasets. For WSNet without both WEM and CSHA, a dramatic decrease in all metrics is observed compared to the standard WSNet. When WSNet is modified to exclude either WEM or CSHA, all metrics show varying levels of improvement; however, there remains a significant performance gap between these variants and the standard WSNet model.

Table 5. Ablation study of the WEM and CSHA in

I o U (%)

,

P d (%)

,

F a (10^{- 5})

on different datasets.

Impact of the Number of WEM Blocks. As shown in Table 6, we conduct ablation experiments to evaluate the impact of using different numbers of WEM blocks. For the first six blocks, we adopt the same configuration as illustrated in Figure 5. For additional blocks beyond that, we continue to expand the model’s receptive field to capture richer contextual information. The results indicate that increasing the number of blocks generally improves performance within the first six WEM blocks, which can be attributed to the enhanced feature representation capacity resulting from a systematic expansion of the network’s width. However, as more blocks are added, a slight performance degradation is observed. This may be due to the fact that while the feature representation ability improves initially with more blocks, beyond a certain point, the gains diminish and the model becomes more susceptible to overfitting. Visualized heatmaps in Figure 9 support the idea that the structured integration of pooling, convolution, and dilated convolution within the WEM significantly enhances its ability to highlight salient regions, increases feature diversity, and captures contextual information more effectively. Although, when the number of blocks exceeds a certain threshold, further increasing them may result in the over–amplification of contextual information.

Table 6. Ablation study on the different number of WEM blocks in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets. The parts in bold font are the standard WEM.

Figure 9. Heatmaps of WEM outputs with varying numbers of blocks. Target is marked in red, useful contextual information is marked in yellow.

Impact of the Design Choice of CSHA. As shown in Table 7, we conduct ablation experiments to evaluate the effectiveness of different design choices for the CSHA module, including the selection of the parameter p in Lp pooling. The baseline model, which does not include the CSHA module, exhibits inferior performance. Adding either channel attention (CA) or spatial attention (SA) individually improves performance, with each component playing a distinct role: CA suppresses irrelevant regions by modulating channel–wise contributions, while SA directs spatial attention toward discriminative local features, thereby reducing the likelihood of missed detections. Furthermore, compared to the widely used CBAM—which relies on Max and Average Pooling in its channel attention component—our CSHA module adopts Lp pooling for more effective feature aggregation. We also explore the influence of the pooling parameter p, noting that

p = 1

corresponds to Average Pooling and

p \to \infty

approximates Max Pooling, both of which yield suboptimal results for IRSTD. In contrast,

p = 2

balances the preservation of global feature distribution with robustness to outliers, leading to better overall performance. Visual results in Figure 10 further illustrate that CSHA enhances the saliency of meaningful target regions while suppressing noise that could otherwise trigger false alarms.

Table 7. Ablation study on the design choice and parameter p selection of CSHA in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets.

Figure 10. Heatmaps of CSHA outputs with different design choices. Target is marked in red, noise information is marked in yellow.

The Impact of Low–level and High–level Feature Fusion Choice and the Multi–scale Feature Fusion Choice in WEM. As shown in Table 8, we conduct ablation studies to evaluate the effectiveness of different design choices for low–level and high–level feature fusion, as well as for multi–scale feature fusion in the WEM. The results indicate that using a simple skip connection with summation to integrate low–level and high–level features yields better performance across most datasets, despite minor degradations in

F a

compared to AFF–ResBlock [60]. For multi–scale feature fusion in the WEM, the summation operation consistently outperforms concatenation across all metrics. These empirical findings suggest that even simple fusion strategies can be highly effective in our model.

Table 8. Ablation study on the design choice of low–level and high–level feature fusion and WEM’s multi–scale features fusion in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets. The parts in bold font are the original design choice of WSNet on different datasets.

4.6. The Experimental Findings

Our experimental results confirm that in Infrared Small Target Detection, expanding network width is more effective than increasing depth, as excessive depth leads to performance degradation. The proposed Width Extension Module (WEM) and Channel–Spatial Hybrid Attention (CSHA) are both essential, with their removal causing significant performance drops. The optimal configuration uses six WEM blocks, beyond which returns diminish. CSHA, with its Lp pooling design, outperforms conventional attention mechanisms by better suppressing noise and emphasizing meaningful features. Furthermore, simple summation–based feature fusion proves more effective than complex fusion strategies. These findings collectively demonstrate that a width–prioritized, lightweight architecture with streamlined modules is well suited to the low–semantic nature of infrared imagery.

5. Conclusions

In this paper, we propose WSNet, a wide and shallow network tailored for IRSTD. By leveraging a wide and shallow architecture and introducing the Width Extension Module and Channel–Spatial Hybrid Attention, WSNet effectively enhances feature representation while maintaining an extremely lightweight structure. With only 0.054 M parameters and 1.050 G FLOPs, WSNet matches SOTA methods on multiple benchmarks, yet requires significantly less computation and memory. Its efficient design enables real–time inference on CPUs, making it highly suitable for deployment in embedded and resource–constrained environments. WSNet provides a practical and effective solution for real–world IRSTD applications and offers a new inspiration for lightweight model design in this field.

Limitation. Although WSNet achieves an excellent balance between detection accuracy and inference speed, it exhibits a relatively high False Alarm Rate (Fa) in semantically complex scenes. This limitation arises from its shallow architecture, which is less effective at extracting high–level semantic features, leading to increased false alarms. While stacking additional convolutional layers at the initial stage could enhance semantic representation which help lower Fa, it would also raise computational costs. Since WSNet is designed for real–time deployment on resource–limited devices such as CPUs and embedded platforms, a deliberate trade–off among detection performance, false alarms, and inference efficiency was made to ensure practical usability.

Author Contributions

Conceptualization, P.L. and Y.L. (Yihan Luo); methodology, P.L. and Y.L. (Yihan Luo); software, P.L., X.Z. and H.J.; validation, X.Z. and H.J.; formal analysis, S.X. and H.J.; investigation, S.X. and Y.L. (Yaqing Liu); resources, Y.L. (Yihan Luo); data curation, Y.L. (Yaqing Liu); writing––original draft preparation, P.L.; writing––review and editing, Y.L. (Yihan Luo) and H.J.; visualization, X.Z.; supervision, Y.L. (Yihan Luo) and H.J.; project administration, Y.L. (Yihan Luo); funding acquisition, Y.L. (Yihan Luo). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Natural Science Foundation of China (62271468).

Data Availability Statement

The datasets utilized in this study are publicly accessible. All data supporting the findings of this work, including those used for training and testing, can be found in the repository associated with this project (accessed on 15 January 2025) at https://github.com/CPaul33/WSNet.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small–target detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
Deng, H.; Sun, X.; Liu, M.; Ye, C.; Zhou, X. Small infrared target detection based on weighted local difference measure. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4204–4214. [Google Scholar] [CrossRef]
Zhang, J.; Tao, D. Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet Things J. 2020, 8, 7789–7817. [Google Scholar] [CrossRef]
Teutsch, M.; Krüger, W. Classification of small boats in infrared images for maritime surveillance. In Proceedings of the 2010 International Waterside Security Conference, Carrara, Italy, 3–5 November 2010; pp. 1–7. [Google Scholar]
Wu, X.; Hong, D.; Chanussot, J. UIU–Net: U–Net in U–Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar] [CrossRef]
Rivest, J.F.; Fortin, R. Detection of dim targets in digital infrared imagery by morphological image processing. Opt. Eng. 1996, 35, 1886–1893. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max–mean and max–median filters for detection of small targets. In Signal and Data Processing of Small Targets 1999; SPIE: Bellingham, WA, USA, 1999; Volume 3809, pp. 74–83. [Google Scholar]
Qin, Y.; Bruzzone, L.; Gao, C.; Li, B. Infrared small target detection based on facet kernel and random walker. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7104–7118. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small–target detection utilizing a tri–layer window. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch–image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared small target detection via non–convex rank approximation minimization joint l_2,1 norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted infrared patch–tensor model with both nonlocal and local priors for single–frame small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared dim and small target detection via multiple subspace learning and spatial–temporal patch–tensor model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3737–3752. [Google Scholar] [CrossRef]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 950–959. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Wu, T.; Li, B.; Luo, Y.; Wang, Y.; Xiao, C.; Liu, T.; Yang, J.; An, W.; Guo, Y. MTU–Net: Multilevel TransUNet for space–based infrared tiny ship detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention–guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared small target detection with scale and location sensitivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 17490–17499. [Google Scholar]
Zhang, M.; Wang, Y.; Guo, J.; Li, Y.; Gao, X.; Zhang, J. IRSAM: Advancing segment anything model for infrared small target detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 233–249. [Google Scholar]
Zhang, R.; Xu, L.; Yu, Z.; Shi, Y.; Mu, C.; Xu, M. Deep–IRTarget: An automatic target detector in infrared imagery using dual–domain feature extraction and allocation. IEEE Trans. Multimed. 2021, 24, 1735–1749. [Google Scholar] [CrossRef]
Zhang, R.; Yang, B.; Xu, L.; Huang, Y.; Xu, X.; Zhang, Q.; Jiang, Z.; Liu, Y. A benchmark and frequency compression method for infrared few–shot object detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5001711. [Google Scholar] [CrossRef]
Yuan, S.; Qin, H.; Yan, X.; Akhtar, N.; Mian, A. Sctransnet: Spatial–channel cross transformer network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Liu, Y.; Ma, Z.; Zhu, W.; Li, N.; Li, C.; Xiong, K.; Wang, Z.; Feng, W.; Jiang, J.; Quan, Y. Forgetting the background: A masking approach for enhanced infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5005615. [Google Scholar] [CrossRef]
Cheng, K.; Ma, T.; Fei, R.; Li, J. A Lightweight Feature Enhancement Model for Infrared Small Target Detection. IEEE Sens. J. 2025, 25, 15224–15234. [Google Scholar] [CrossRef]
Li, B.; Wang, Y.; Wang, L.; Zhang, F.; Liu, T.; Lin, Z.; An, W.; Guo, Y. Monte Carlo linear clustering with single–point supervision is enough for infrared small target detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1009–1019. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Liu, M.; Du, H.Y.; Zhao, Y.J.; Dong, L.Q.; Hui, M.; Wang, S.X. Image small target detection based on deep learning with SNR controlled sample generation. In Current Trends in Computer Science and Mechanical Automation; De Gruyter: Berlin, Germany, 2017; Volume 1, pp. 211–220. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet–level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real–time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Tan, M.; Le, Q.V. Mixconv: Mixed depthwise convolutional kernels. arXiv 2019, arXiv:1907.09595. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Zhao, M.; Cheng, L.; Yang, X.; Feng, P.; Liu, L.; Wu, N. TBC–Net: A real–time detector for infrared small target detection using semantic constraint. arXiv 2019, arXiv:2001.05852. [Google Scholar]
Wu, S.; Xiao, C.; Wang, L.; Wang, Y.; Yang, J.; An, W. Repisd-net: Learning efficient infrared small-target detection network via structural re-parameterization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5622712. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Yu, Y.; Peng, Z.; Yang, M.; Huang, F.; Fu, Q. LW-IRSTNet: Lightweight infrared small target segmentation network and application deployment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5621313. [Google Scholar] [CrossRef]
Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The expressive power of neural networks: A view from the width. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust infrared small target detection network. IEEE Geosci. Remote Sens. Lett. 2021, 19, 7000805. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large–scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze–and–excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]

Figure 1. The trend of Params and FLOPs in mainstream deep learning–based methods for IRSTD. Although models such as MSHNet and later approaches have achieved significant reductions in both parameters and computational cost, they still typically contain several million parameters and require several GFLOPs of computation. This level of complexity makes them unrealistic for deployment on embedded and resource–constrained hardware.

Figure 2. Comparison between infrared images and conventional images. Compared to conventional images, targets and their surrounding environments in infrared images generally lack rich semantic information due to the absence of hue and detailed texture [31]. (a–c): Infrared images; (d–f): conventional images. The object inside red box is our detection target.

Figure 3. Trade–off among IoU, Params and FPS, tested on SIRST dataset. Our WSNet achieves the smallest model size and the fastest inference speed, while simultaneously maintaining a competitive IoU performance.

Figure 4. The overall architecture of the proposed WSNet. It leverages a highly streamlined architecture, comprising only a few convolutional layers and the attention–guided mechanism, to deliver compelling model performance.

Figure 5. The detailed structure of Width Extension Module (WEM). The WEM consists of multiple branches, each with a different receptive field (RF).

Figure 6. The channel attention (CA) of our CSHA. (a) Channel attention of the typical CBAM [32]. (b) Redesigned channel attention of our CSHA.

Figure 7. The spatial attention (SA) of our CSHA.

Figure 8. Qualitative comparison with state–of–the–art methods on several infrared images. Correctly detected targets are marked in red, missed targets in blue, and false alarms in yellow. For better visualization, a close–up view of the target is shown in a larger box.

Figure 9. Heatmaps of WEM outputs with varying numbers of blocks. Target is marked in red, useful contextual information is marked in yellow.

Figure 10. Heatmaps of CSHA outputs with different design choices. Target is marked in red, noise information is marked in yellow.

Table 1. Quantitative results of different SOTA methods. Results for the metrics of

P a r a m s (M)

,

F L O P s (G)

,

F P S

,

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

in different datasets are presented. ‘↓’ denotes lower is better and ‘↑’ denotes higher is better. The best results are in bold.

Table 1. Quantitative results of different SOTA methods. Results for the metrics of

P a r a m s (M)

,

F L O P s (G)

,

F P S

,

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

in different datasets are presented. ‘↓’ denotes lower is better and ‘↑’ denotes higher is better. The best results are in bold.

Method	$Params ↓$	$FLOPs ↓$	SIRST				NUDT–SIRST				IRSTD–1K
Method	$Params ↓$	$FLOPs ↓$	$FPS ↑$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$FPS ↑$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$FPS ↑$	$IoU ↑$	$Pd ↑$	$Fa ↓$
Top-Hat [7]	-	-	-	7.142	79.841	10,120.03	-	20.724	78.408	1667.04	-	10.062	75.108	14,320.03
Max-Median [8]	-	-	-	1.168	30.196	553.32	-	4.201	58.413	368.88	-	7.003	65.213	597.31
WSLCM [12]	-	-	-	1.021	80.987	458,461.64	-	0.848	74.574	523,916.33	-	0.989	70.026	150,270.84
TLLCM [11]	-	-	-	11.034	79.473	72.68	-	7.059	62.014	461.18	-	5.357	63.966	49.28
IPI [13]	-	-	-	25.674	85.551	114.7	-	17.758	74.486	412.3	-	27.923	81.374	161.83
NRAM [14]	-	-	-	12.164	74.523	138.52	-	6.931	56.403	192.67	-	15.249	70.677	169.26
RIPT [15]	-	-	-	11.048	79.077	226.12	-	29.441	91.85	3443.03	-	14.106	77.548	283.1
PSTNN [16]	-	-	-	22.401	77.953	291.09	-	14.848	66.132	441.7	-	24.573	71.988	352.61
MSLSTIPT [17]	-	-	-	10.302	82.128	11,310.02	-	8.341	47.399	881.02	-	11.432	79.027	1524.004
ISNet [21]	0.966	30.618	19	70.491	95.057	6.798	24	81.236	97.778	0.634	13	61.852	90.236	3.156
DNA–Net [20]	4.697	14.261	21	76.169	97.338	1.454	33	93.331	98.672	0.549	12	65.466	94.276	1.615
UIU–Net [6]	50.540	54.426	16	76.187	95.057	1.077	32	92.393	97.989	0.356	11	64.200	89.226	2.517
AGPCNet [23]	12.360	43.181	14	69.730	94.677	1.604	15	73.910	97.672	2.321	14	61.382	84.014	2.057
MTU–Net [22]	8.221	99.437	10	69.081	97.719	3.500	15	79.024	97.884	2.874	9	61.401	90.416	2.874
MSHNet [24]	4.065	6.110	31	75.116	92.015	2.257	47	85.416	97.566	1.841	24	64.268	90.219	1.845
SCTransNet [28]	11.190	10.120	9	76.277	96.192	2.046	13	93.472	98.280	0.625	8	66.382	91.527	1.015
BGM [29]	4.076	6.773	32	75.372	95.502	4.175	50	81.262	95.793	1.225	26	63.718	90.498	1.896
HFMNet [30]	6.090	11.190	10	74.427	95.762	5.991	17	87.729	96.957	3.617	10	66.518	91.726	1.066
WSNet (Ours)	0.054	1.050	119	76.929	96.578	2.751	146	91.194	98.730	0.432	60	64.467	90.572	2.725

Table 2. Quantitative results of various lightweight models. Results for the metrics of

P a r a m s (M)

,

F L O P s (G)

,

F P S

,

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

in different datasets are presented. The best results are in bold.

Table 2. Quantitative results of various lightweight models. Results for the metrics of

P a r a m s (M)

,

F L O P s (G)

,

F P S

,

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

in different datasets are presented. The best results are in bold.

Method	$Params ↓$	$FLOPs ↓$	SIRST				NUDT–SIRST				IRSTD–1K
Method	$Params ↓$	$FLOPs ↓$	$FPS ↑$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$FPS ↑$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$FPS ↑$	$IoU ↑$	$Pd ↑$	$Fa ↓$
ACM [19]	0.398	0.402	68	67.942	91.255	2.957	72	68.902	95.979	1.342	43	62.281	90.909	4.648
ALCNet [58]	0.427	0.378	56	67.437	90.494	2.511	83	63.917	94.603	2.188	47	60.530	90.236	4.097
ISNet [21]	0.966	30.618	19	70.491	95.057	6.798	24	81.236	97.778	0.634	13	61.852	90.236	3.156
RDIAN [53]	0.217	3.718	83	69.190	92.395	3.416	97	78.639	96.508	1.744	42	60.613	87.205	3.004
LW–IRSTNet [46]	0.163	0.304	50	68.340	95.057	1.476	57	61.411	91.640	3.481	51	51.094	84.014	0.432
RepISD [45]	0.309	7.052	76	73.821	95.057	5.310	96	91.984	98.201	0.699	45	60.357	94.613	7.755
HFMNet–Tiny [30]	1.540	2.900	26	74.611	93.869	6.552	39	85.402	94.441	5.779	21	64.094	89.967	1.950
WSNet (Ours)	0.054	1.050	119	76.929	96.578	2.751	146	91.194	98.730	0.432	60	64.467	90.572	2.725

Table 3. Inference speed comparison evaluated on the SIRST dataset using an Intel Xeon Platinum 8352 V CPU.

Method	DNANet	MSHNet	SCTransNet	RepISD	HFMNet–Tiny	WSNet
FPS	5	7	2	14	6	30

Table 4. Results by stacking different numbers of ResBlocks. The parts in bold font are the best results.

ResBlock	SIRST			NUDT–SIRST			IRSTD–1K
ResBlock	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$
1	73.606	93.536	3.451	85.027	97.884	1.539	60.192	86.195	1.587
2	75.005	95.060	3.598	87.906	98.942	1.709	63.459	89.521	2.881
3	75.236	95.437	2.380	87.244	98.518	1.974	62.117	88.215	2.243
4	75.643	96.198	3.629	86.898	98.941	3.345	61.789	87.542	1.245
5	73.270	92.776	4.157	-	-	-	-	-	-
6	72.220	94.676	5.858	-	-	-	-	-	-

Table 5. Ablation study of the WEM and CSHA in

I o U (%)

,

P d (%)

,

F a (10^{- 5})

on different datasets.

Table 5. Ablation study of the WEM and CSHA in

I o U (%)

,

P d (%)

,

F a (10^{- 5})

on different datasets.

Method	SIRST			NUDT–SIRST			IRSTD–1K
Method	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$
w/o WEM+CSHA	51.494	85.932	12.629	69.052	87.196	11.635	38.742	80.808	15.500
w/o WEM	66.561	91.255	5.859	82.230	97.519	3.152	47.251	68.6187	2.175
w/o CSHA	66.532	92.395	7.148	85.421	96.085	5.113	57.635	89.562	6.859
WSNet	76.929	96.578	2.751	91.194	98.730	0.432	64.467	90.572	2.725

Table 6. Ablation study on the different number of WEM blocks in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets. The parts in bold font are the standard WEM.

Table 6. Ablation study on the different number of WEM blocks in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets. The parts in bold font are the standard WEM.

Number	SIRST			NUDT–SIRST			IRSTD–1K
Number	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$
0	66.561	91.255	5.859	82.230	97.519	3.152	47.251	68.687	2.175
1	67.903	95.817	7.086	84.596	98.624	2.847	48.748	89.226	13.928
2	71.544	96.958	4.315	85.487	98.836	1.534	57.215	81.481	1.892
3	73.237	96.198	3.382	86.104	98.307	1.021	58.775	84.175	1.353
4	75.824	97.338	2.586	88.625	98.624	0.845	59.640	86.195	1.539
5	75.712	95.635	2.720	90.012	98.519	0.622	62.272	88.215	2.422
6	76.929	96.578	2.751	91.194	98.730	0.432	64.467	90.572	2.725
7	75.842	93.550	3.023	90.352	98.112	0.912	63.742	88.409	2.450
8	75.224	95.621	3.200	90.082	96.992	1.020	62.875	89.710	2.818

Table 7. Ablation study on the design choice and parameter p selection of CSHA in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets.

Table 7. Ablation study on the design choice and parameter p selection of CSHA in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets.

Type	SIRST			NUDT–SIRST			IRSTD–1K
Type	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$
baseline	66.532	92.395	7.148	85.421	96.085	5.113	57.635	89.562	6.859
+ CA (SENet [55])	68.652	93.916	7.265	88.986	98.730	1.847	59.173	85.859	1.961
+ SA	68.228	93.536	5.838	84.861	96.402	2.195	59.490	91.246	4.813
+CBAM [32]	73.279	93.536	4.123	89.856	98.730	1.081	62.181	86.195	1.959
+CSHA (Ours)	76.929	96.578	2.751	91.194	98.730	0.432	64.467	90.572	2.725
$p = 1$	73.473	94.704	4.233	90.142	98.116	0.824	63.072	90.351	3.893
$p = 2$	76.929	96.578	2.751	91.194	98.730	0.432	64.467	90.572	2.725
$p = 3$	74.679	96.192	3.424	89.758	96.667	0.936	63.861	90.385	3.367
$p = 4$	73.465	94.252	4.794	88.743	95.975	1.269	62.460	88.729	3.479

Table 8. Ablation study on the design choice of low–level and high–level feature fusion and WEM’s multi–scale features fusion in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets. The parts in bold font are the original design choice of WSNet on different datasets.

Table 8. Ablation study on the design choice of low–level and high–level feature fusion and WEM’s multi–scale features fusion in

I o U (%)

,

P d (%)

, and

F a (10^{- 5})

on different datasets. The parts in bold font are the original design choice of WSNet on different datasets.

Type		SIRST			NUDT–SIRST			IRSTD–1K
Type		$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$	$IoU ↑$	$Pd ↑$	$Fa ↓$
low–level and high–level feature fusion choice	AFF–ResBlock [60]	76.810	95.967	2.031	91.364	98.389	0.367	64.251	90.176	2.456
	iAFF–ResBlock [60]	75.802	95.143	2.769	90.730	98.107	0.958	64.002	89.793	2.675
	skip connection (concatenation)	74.672	94.446	3.110	90.497	97.142	1.097	62.501	88.775	2.729
	skip connection (summation)	76.929	96.578	2.751	91.194	98.730	0.432	64.467	90.572	2.725
WEM’s multi–scale features fusion choice	concatenation	75.619	94.160	3.493	90.223	98.712	1.174	63.290	89.990	2.853
WEM’s multi–scale features fusion choice	summation	76.929	96.578	2.751	91.194	98.730	0.432	64.467	90.572	2.725

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

A Wide and Shallow Network Tailored for Infrared Small Target Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Infrared Small Target Detection

2.2. Lightweight Network

2.3. Width Extension Learning

2.4. Attention Mechanism

3. Methodology

3.1. Overall Architecture

3.2. Width Extension Module

3.3. Channel–Spatial Hybrid Attention

3.4. Fully Convolutional Network Head

4. Experiment

4.1. Implementation Details

4.2. Comparison to State–of–the–Art Methods

4.3. Comparison to Lightweight Methods

4.4. Deployment in Resource–Constrained Device

4.5. Ablation Study

4.6. The Experimental Findings

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Article Access Statistics