DSAD: Multi-Directional Contrast Spatial Attention-Driven Feature Distillation for Infrared Small Target Detection

Li, Yonghao; Li, Boyang; Zhang, Guoliang; Chen, Jun; Deng, Siyi; Zhang, Hanxiao

doi:10.3390/rs17203466

Open AccessArticle

DSAD: Multi-Directional Contrast Spatial Attention-Driven Feature Distillation for Infrared Small Target Detection

by

Yonghao Li

,

Boyang Li

,

Guoliang Zhang

,

Jun Chen

^*

,

Siyi Deng

and

Hanxiao Zhang

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(20), 3466; https://doi.org/10.3390/rs17203466

Submission received: 24 July 2025 / Revised: 6 October 2025 / Accepted: 12 October 2025 / Published: 17 October 2025

(This article belongs to the Special Issue Deep Learning-Based Small-Target Detection in Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Highlights

What are the main findings?

Infrared small targets exhibit weak responses and are easily submerged by strong background noise. Moreover, existing distillation methods have insufficient ability to capture the spatial information of small targets, leading student networks to pay more attention to unrelated background but ignore the small target representation knowledge transferring.
There are significant feature differences and representational gaps in hierarchical features between teacher and student networks. Forcing feature matching could weaken the learning ability of student networks for small targets.

What is the implication of the main finding?

We propose a Multi-Directional Contrast Spatial Attention (DSA) mechanism for IRSTD. The Multi-Directional Contrast Spatial Attention (DSA) module can capture small target spatial features across eight discrete directions in a parameter-free manner, thereby enhancing detection performance without increasing computational cost.
We design the Gaussian transformation of features to leverage feature discrepancies between student and teacher networks, and integrate spatial weights derived from the DSA module to design the Perception Weight Mean Square Error (PWMSE) distillation loss, enhancing the efficient transmission of small target feature representations. Our DSAD method achieves promising results and even exhibits comparable detection performance (e.g., IoU, Pd) to the teacher networks.

Abstract

Recent deep learning methods have achieved promising performance in infrared small target detection (IRSTD) but with high computational cost, limiting deployment or operation on resource-limited scenarios. There is an urgent need to develop both lightweight and high-precision model compression methods. In this paper, we propose a Multi-Directional Contrast Spatial Attention-driven Feature Distillation (DSAD) method for achieving quick and high-performance IRSTD. Specifically, we first extract feature maps from teacher and student networks. Then, a standard Gaussian transformation is adopted to eliminate magnitude effects. After that, a Multi-Directional Contrast Spatial Attention (DSA) is designed to capture multi-directional spatial information from teacher features, which can make student networks pay more attention to small target areas while suppressing background. Finally, we propose a Perceptual Weighted Mean Square Error (PWMSE) distillation loss by combining the DSA with feature discrepancies, guiding student networks to learn more effective information from small target features. Experimental results on the two benchmark datasets (e.g., NUDT-SIRST and NUAA-SIRST) demonstrate that our distillation method can achieve remarkable detection performance compared with the teacher counterparts on several benchmark IRSTD networks (e.g., DNANet, AMFU-Net, and DMFNet) and introduce consistent gains in inference speed (i.e., 2× more) on edge devices (NVIDIA AGX and HUAWEI Ascend-310B).

Keywords:

infrared small target detection; knowledge distillation; spatial attention; direction contrast; edge deployment; lightweight model

1. Introduction

Infrared small target detection (IRSTD) aims at separating small targets from complex backgrounds [1] and has been widely applied in many scenes such as military early warning [2], optical surveillance, and security facilities [3]. With the development of deep learning, more and more Convolutional Neural Networks have achieved remarkable advantages through the pixel-level segmentation [4]. However, high-performance IRSTD models often adopt massive connections and complex structures [5], resulting in models with millions of parameters and billions of floating-point operations [6]. For this reason, these complex models are unsuitable for resource-limited scenarios (e.g., micro-satellite, unmanned aerial vehicles) [7]. This dilemma forces us to resolve a critical challenge in IRSTD: designing an IRSTD network that achieves both quicker inference and high detection performance.

With the increasing demand for real-time inference on real scenarios, lightweight networks attracted extensive attention due to their small sizes and low computational costs. Despite the proposal of various efficient modules, lightweight networks still struggle to match the accuracy performance of these large-scale networks. In recent years, model compression [8] has become an essential step for achieving edge deployment ability, which includes pruning [9], low-rank decomposition [10], quantization [11], and knowledge distillation [12]. The former three types of model compression methods primarily focus on simplifying network architectures without consideration of knowledge transformation [4]. Knowledge distillation (KD) [12] is a widely used compression method which is generally considered to be able to achieve an effective balance between model computational cost and accuracy performance, as shown in Figure 1. Knowledge distillation operates by transferring useful information from large and high-performance networks to optimize the training of lightweight student networks.

Although KD has demonstrated remarkable effectiveness in general vision tasks [4], due to the extreme characteristics of small targets (e.g., small size, weak shape and texture) [13], the direct application to IRSTD may result in suboptimal outcomes. The main reasons for this are as follows: (1) Inefficient knowledge transfer of small targets: Small targets produce weak responses that are easily submerged by strong background noise, resulting in information diffusion in important target regions. Existing methods [14,15,16] mainly rely on feature matching between teacher and student networks, while lacking discrimination of spatial information, consequently struggling to distinguish weak target features from complex background interference, and thus lead student networks to pay more attention to unrelated background information but ignore the small-target-representation knowledge transferring. (2) Hierarchical feature representation mismatch: The superior capacity of teacher networks is visible in their highly distinguishable attributes of small targets. The student network is easily constrained by inherent structural limitations (e.g., shallower depth, fewer channels), generally resulting in a representational gap in hierarchical features compared to the teacher network. Existing methods [16,17,18] perform strict feature imitation, which may force the student network to be consistent with incompatible high-level semantic representations. The representation mismatch weakens the ability of the student network to learn the discriminative features of small targets.

To address these challenges, we propose a Multi-Directional Contrast Spatial Attention-driven Feature Distillation (DSAD) method, as illustrated in Figure 2. First, teacher and student feature maps are selected and spatially aligned. Then, both features are transformed into standard Gaussian distribution along channel and spatial dimensions to eliminate magnitude disparities, thereby focusing on feature distribution patterns and relational semantics. To more effectively exploit the spatial information of feature maps, we design a Multi-Directional Contrast Spatial Attention (DSA). Inspired by previous methods [19,20], the DSA begins with average pooling and max pooling of the input teacher feature map for channel aggregation. After that, eight disperse directional convolution kernels are employed to extract directional features. Local contrasts are computed for each direction, and the minimum contrast across channels at each spatial position is regraded as the spatial weight, resulting in a single-channel spatial attention map. Finally, using the feature discrepancies between teacher and student networks alongside the DSA-derived spatial weights, we design a Perception Weight Mean Square Error (PWMSE) loss, which can well enhance small target areas and suppress background clutter. The DSAD method guides the lightweight network to focus on information areas and target features, improving detection performance without sacrificing efficiency.

To the best of our knowledge, this is the first paper to both achieve IRSTD network knowledge distillation and on diversified Edge hardware deployments (e.g., NVIDIA AGX, HUAWEI Ascend-310B). The contribution of this study can be summarized as follows:

1.: This paper proposes a Multi-Directional Contrast Spatial Attention-driven Feature Distillation (DSAD) method for IRSTD. Our method leverages feature discrepancies between student and teacher networks, and integrates spatial weights derived from the DSA module to design the PWMSE distillation loss, thereby achieving efficient transmission of small target feature representations.
2.: We introduce a Multi-Directional Contrast Spatial Attention (DSA) mechanism for IRSTD. The DSA module can capture small target spatial features across eight discrete directions with a parameter-free manner, thereby enhancing detection performance without increasing computational cost.
3.: The experimental results demonstrate that our DSAD can achieve promising results and even achieve comparable detection performance (e.g., $I o U$ , $P_{d}$ ) with the teacher network. The inference latency of the student networks can achieve 2× more acceleration on the NVIDIA AGX and HUAWEI Ascend-310B.

2. Related Work

In this section, we briefly review the major works in IRSTD and knowledge distillation methods.

2.1. Infrared Small Target Detection

In recent years, IRSTD has been the focus of extensive research, which can be divided into model-driven traditional methods [21,22] and data-driven deep learning methods [2,23,24]. Deep learning methods have demonstrated powerful model fitting capabilities, particularly in handing real complex scenarios characterized by dramatic variations in target size, shape, and cluttered backgrounds.

Some researchers [25,26,27] have utilized deep learning methods to explore various high-performance IRSTD networks. ALCNet [28], proposed by Dai, reformulated traditional LCM into a parameter-free nonlinear feature refinement layer.To maintain the deep characteristic responses of small targets, Li et al. [2] proposed the DNANet network, a network architecture specifically designed for the IRSTD to maintain the response of small target in deep CNN layers, but lacked dynamic adaptability to targets. UIUNet [29] integrated a small U-Net [30] into a large U-Net backbone, thereby enabling multi-level and multi-scale representation learning. After that, the attention-guided pyramid context association network AGPCNet [31] computed local feature maps before global associations to enhance feature utilization, but may overlook certain local information. AMFU-Net [23] used U-Net3+ [32] full-scale connection features to enhance the feature-extraction ability through multi-scale feature fusion. However, the differences in contributions among features of scales remain insignificant. Zhang et al. [33] designed FC3Net, a high-capacity network. In the encoder–decoder structure, multi-layer feature compensation modules and cross-layer feature-association modules were embedded to achieve fine segmentation. IRPruneDet [34] is the first work to introduce pruning method into IRSTD network. By leveraging wavelet transform to evaluate the importance of network components and adopting a soft channel reconstruction strategy, it achieves highly efficient detection performance, but its operational efficiency requires further validation. To achieve background suppression and target localization capabilities, GCI-Net [35] employed Gaussian curvature branches and complementary group attention to form a multi-branch collaborative framework, yet suffered from difficult collaborative optimization across branches. To effectively capture target features, DMFNet [24] utilizes a dual-encoder and integrates three attention modules to enhance detection performance, yet the integration of multiple modules results in increased structural complexity. Zhang et al. [6] recently proposed IRSAM model, which adapts the Segment Anything Model (SAM) for IRSTD. IRSAM model employed a Wavelet Progressive Multi-scale Diffusion (WPMD) to suppress noise while preserving target feature, and integrated a perceptual encoder to integrate multi-scale features, leading to enhance detection performance.

Although the IRSTD networks based on CNN have shown impressive performance, most of them rely on large-scale complex connection or stacking modules [2,23,24], which inevitably leads to an increase in the computation costs (e.g., Params and FLOPs) and inference latency [4]. It has introduced lots of limitations in practical applications with limited resources.

2.2. Knowledge Distillation

Knowledge Distillation as a model training strategy that enhances the performance of lightweight student models through knowledge transfer from high-performance teacher models [12]. The KD methods are broadly categorized into logit distillation and feature distillation, as shown in Figure 3.

Early research on knowledge distillation (KD) focused on image classification, and then are gradually extended to dense prediction tasks (e.g., object detection, semantic segmentation) [36]. To improve imitation capability, FitNet proposed by Romero et.al. [37] firstly pioneered intermediate feature regression to guide student representation learning, extending distillation beyond output layers. Huang et.al. [38] proposed Neural Selection Transfer (NST), which introduced maximum mean discrepancy to align neuronal selectivity distributions, establishing an implicit supervised learning mechanism, but it is susceptible to feature deviations. Zagoruyko et.al. [17] proposed Attention Transfer (AT), which advanced spatial attention mask construction through aggregated feature channel transfer, but it was unable to fully decouple different attentions. KDSVD [18] integrated Singular Value Decomposition (SVD) for compressing teacher feature and Radial Basis Function (RBF) for feature correlation measurement. To promote instance-level feature matching and transfer structural relationships between samples, Park et.al. [39] introduced Relation Knowledge Distillation (RKD). It constrained the differences between teacher and student samples through distance loss, and the consistency of spatial angles through angle loss. However, it has a relatively high dependence on the establishment of constructions. Peng et.al. proposed a Correlation Consistent Knowledge Distillation (CCKD) [40], which further applied RBF to capture inter-sample correlations and used Taylor series expansion to approximate kernel functions. However, it heavily relied on the selection of kernel functions. To enhance the integrity of features, Yang et.al. [14] proposed Masked Generative Distillation (MGD), which incorporated a generative task to learn semantic associations by randomly masking student features and guiding the reconstruction of complete teacher features, potentially hindering accurate learning of feature correlations. ST-KDNet [41] implemented semantic knowledge transfer through cross-stage KD loss.Regarding the issue of feature differences in heterogeneous detection models, Pearson Knowledge Distillation (PKD) [15] introduced Pearson correlation coefficients to eliminate magnitude mismatches issues but exhibits limited effectiveness in handing features with non-linear relationships. To resolve conflicts between detection outputs and labels, CrossKD [16] generates cross-head prediction and mimics teacher predictions to enable heterogeneous detector adaptation, though it relies on hyperparameter selection.

Although promising progress has been achieved, most of the distillation methods mainly focus on general visual tasks, research on infrared small target knowledge distillation is still in early stage. Chen et.al. [42] proposed a feature-based infrared small target distillation (IRKD) method, which utilizes Unified Channel–Spatial Attention (UCSA) to highlight feature differences. However, the distillation process is susceptible to interference from irrelevant information. Tang et.al. [3] proposed a knowledge distillation-based infrared small target detection (KDD) method, which employed feature fusion loss to achieve knowledge transfer, yet introduces additional computational overhead. The operational efficiency of these methods on resource-limited scenarios remains to be validated. This paper aims to explore an feature distillation-based method to significantly maintain the student network’s detection performance without sacrificing efficiency.

3. Methodology

In this section, we introduce our DSAD method in detail.

3.1. Motivation

For lightweight IRSTD networks, we observe that their inherent structural constraints the detection performance compared to advanced, larger-parameter ones. To achieve quicker detection without sacrificing detection performance, in the DSAD method, we propose a Multi-Directional Contrast Spatial Attention module that effectively extracts critical information from teacher output features. Moreover, a Gaussian distribution strategy is designed for feature alignment to eliminate magnitude discrepancies between teacher and student networks. The DSAD enables precise matching and transfers of small target features while suppressing the negative impact of background noise, thereby efficiently guiding the student network to learn informative features.

3.2. Features Selection

Considering the characteristics of IRSTD networks, we select critical intermediate features from encoder–decoder structures or specialized network hierarchies. These cross-layer features contain multi-scale information with rich local and global semantics, supplying rich supervisory learning information for knowledge distillation. Considering the different sizes between teacher and student features, we first adjust the size of student feature maps to ensure spatial dimensional consistency. We assume that the student feature map at a specific layer is denoted as

F_{S} \in R^{C_{S} \times H_{S} \times W_{S}}

, and the corresponding feature map of the teacher model can be denoted as

F_{T} \in R^{C_{T} \times H_{T} \times W_{T}}

. By using bilinear interpolation to adjust student feature map

F_{S}

to match the dimension of teacher map

F_{T}

, it can be expressed as follows:

F_{S} = Φ (F_{S}, (H_{T}, W_{T})),

(1)

where

Φ

represents the bilinear interpolation function, which has been integrated into the PyTorch2.0 library package.

H_{T}

and

W_{T}

represent the height and width of the teacher feature map, respectively.

After selecting the intermediate features of teacher and student networks, the energy distribution maps of the intermediate feature maps of the input image can be observed. The differences between the teacher features and student features, as shown in Figure 4.

3.3. Directional Contrast Spatial Attention

To answer the question of “where the target areas are”, we propose a multi-Directional Contrast Spatial Attention (DSA), which is applied in the DSAD method. As shown in Figure 5, the DSA module effectively captures feature response in eight discrete principal directions and computes response contrast in these directions, effectively enhancing the feature representation of target areas while suppressing irrelevant background information. By using DSA to extract the spatial information from teacher features, it enables refined alignment of critical positional cues, thereby providing more precise guidance for the student network learning.

3.3.1. Spatial Aggregation

Pooling operations along the channel dimension (e.g., average pooling and max pooling) have proven effective in aggregating spatial saliency information [19]. Specifically, for an input feature map, we first apply global average pooling and max pooling along the channel dimension in parallel. This process results in two complementary 2D feature maps (

F_{a v g} \in R^{B \times 1 H \times W}

and

F_{\max} \in R^{B \times l \times H \times W}

) that capture global spatial statistics across channels. Subsequently, through the channel fusion strategy, these two complementary feature maps are integrated to generate a compact spatial feature descriptor

F_{c a t}

that encodes the salient structural information of the input feature map.

\begin{matrix} F_{c a t} = [a ν g p o o l (F); m a x p o o l (F)] \\ = c a t (F_{a v g}, F_{m a x}) \in R^{B \times 2 \times H \times W}, \end{matrix}

(2)

where

c a t

represents the concatenation of multiple feature maps along the channel dimension,

a ν g p o o l ()

and

m a x p o o l ()

represent global average pooling and max pooling along the channel dimension in parallel, respectively.

3.3.2. Directional Convolution Kernel

Inspired by the design of CBAM module [19] which employs a broader receptive field to capture spatial areas, we adopt a similar approach with a 7 × 7 receptive field. However, the standard 7 × 7 convolution is isotropic and lack directional sensitivity, limiting its ability to capture orientation variations in small target structures. To address this, we propose eight discrete directional convolution kernels that imitate human visual sensitivity to oriented patterns, thereby capturing multi-directional spatial features.

To achieve an expanded 7 × 7 receptive field without increasing count, we expand the base 3 × 3 kernel to 7 × 7 using bilinear interpolation. This design maintains the parameter efficiency of 3 × 3 kernels while enabling 7 × 7 scale spatial feature extraction.

To make the convolution kernels aware of spatial direction, we add discrete directional binary masks into standard convolution operations. Specifically, eight 7 × 7 binary directional masks that cover eight discrete directions

d i, i \in {1, 2 . . . 8}

. These directions divide the image gradient space into eight 45° sectors, covering positive directions

{0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}}

and negative directions

{180^{\circ}, 225^{\circ}, 270^{\circ}, 315^{\circ}}

. These eight directions are further divided into four orthogonal directions

{o d 1, o d 2, o d 3, o d 4}

, where the positive and negative directions in each orthogonal direction are separated by 180°. From the kernel center, three consecutive pixels along positive directions are assigned with value 1, while those in inverse directions are assigned with value −1. These predefined masks

m_{d i}, i \in {1, 2, . . . 8}

are then element-wise multiplied with base kernels

W_{b a s e}^{7 \times 7}

to generate eight direction-specific convolution kernels

{k_{d 1}, k_{d 2} . . k_{d 8}}

, which can be represented as:

k_{d i} = m_{d i} ⊙ W_{b a s e}^{7 \times 7}, i \in {1, 2 . . . 8},

(3)

where ⊙ represents pixel-by-pixel multiplication.

Figure 6 provides a detailed illustration of the multi-directional convolution kernels. These directional kernels

k_{d i}

constrained by the binary masks, and the weights are only retained in their designated directions but assigned as zero in other directions. This design introduces anisotropic kernels that selectively respond to localized intensity gradients aligned with specific orientations (e.g., target-background connection areas). The specific excitation mechanism enhances sensitivity to features without introducing extra complexity.

3.3.3. Directional Feature Contrast

Utilizing the designed directional convolution kernels, a pixel-wise convolution

F_{c a t}

is applied to the spatially aggregated features. This process generates eight directional specific feature response maps

F_{d i}

, as formulated in Equation (4):

F_{d i} = F_{c a t} ⊙ k_{d i}, i \in {1, 2 . . . 8},

(4)

where

k_{d i}, i \in {1, 2, . . . 8}

represents the eight distinct directions.

These eight distinct directional feature maps represent the spatial distribution of feature intensity distributions of orientation specific patterns. Leveraging the intensity contrast between small target and surrounding backgrounds, we further compute contrast measures across four pairs of orthogonal directions in a cross-shaped pattern. Specifically, for each orthogonal direction, we square the difference between the feature responses from the positive and negative orientations. The final values quantify the contrast in that particular direction, generating corresponding spatial contrast maps. This multi-directional contrast is adopted to distinguish spatial differences.

C_{i} = {(F_{d i} - F_{d (i + 4)})}^{2}, i \in [1, 2, 3, 4],

(5)

where

C_{i} \in R^{1 \times H \times W}

represents the spatial contrast map of the feature map in orthogonal directions.

High contrast values indicate potential salient edge or texture areas, thereby highlighting small target areas. Conversely, low contrast areas may represent isotropic background areas with minimal feature differentiation. Crucially, small target areas maintain consistent high-contrast characteristics across multi-directional observations. In contrast, background noise or spurious features often show asymmetrically, with high contrast in select directions and diminished contrast in others.

3.3.4. Minimum Contrast Operation

Inspired by the Buckets effect, the weight of each area is evaluated using the minimum contrast response across multi directions. Spatial areas can only obtain higher salient weights when demonstrating high-intensity contrast responses across multi directions. Specifically, we first extract contrast intensity maps along four primary directions. Then, for each spatial location, the minimal contrast value across all directions is selected via a channel-wise operation, resulting in a single-channel spatial weight map

F_{C_{\min}} \in R^{1 \times H \times W}

, where each each pixel’s value retains the weakest directional contrast response from its corresponding region, which can be represented as follows:

F_{C_{min}} = \sum_{i = 1}^{4} \sum_{(x, y)}^{(H, W)} min (C_{i} (x, y)),

(6)

where H and W represent the height and width spatial dimensions, respectively.

C_{i}

represents the spatial contrast map in the orthogonal directions.

To further enhance the spatial attention representation, we apply the sigmoid activation function to the obtained spatial weight map

F_{C_{min}}

, producing the final spatial attention weight distribution, which can be represented as Equation (7):

M_{D S A}^{F} = s i g m o i d (F_{C_{min}}),

(7)

where F represents the input feature map,

M_{D S A}^{F}

represents the derived spatial attention weights of DSA module, and

s i g m o i d

represents the sigmoid activation function.

The resultant weight map encodes spatial saliency information: high weights areas correspond to critical small targets while low weights effectively suppress background noise. The minimum value operation guides the network to precisely concentrate on small target areas, thereby improving detection performance through enhanced spatial awareness.

3.4. Gaussian Transformation of Feature

Due to inherent structural differences between teacher and student networks, their feature activation magnitudes usually vary a lot. Directly aligning the feature maps imposes overly strict constraints and introduces substantial irrelevant noise [14,16], which will lead to destroy student network performance after distillation.

To eliminate the effects of scale discrepancies among feature maps and enable the student to focus on the teacher’s feature distribution patterns and relative response relationships rather than absolute activation values, we implement channel-wise Gaussian transformation on feature maps. For each channel, the feature map is normalized to a distribution with zero mean and unit variance, aligning their distributions to a common scale. This approach effectively eliminates magnitude effects across features and facilitates effective measurement of feature discrepancies between teacher and student networks.

Specifically, given an input feature map

F \in R^{N \times C \times H \times W}

, we reshape it into a matrix of size

(C, M)

, where

M = N \times H \times W

represents the total number of samples unfolded from the spatial and channel dimensions, and the feature map is computed as

F^{'} \in R^{C \times M}

. For each channel c, the mean

μ_{c}

and variance

σ_{c}

of the overall feature are computed as follows:

μ_{c} = \frac{1}{M} \sum_{m = 1}^{M} F_{(c, m)}^{'}

(8)

σ_{c} = \sqrt{\frac{1}{M} \sum_{m = 1}^{M} {(F_{(c, m)}^{'} - μ_{c})}^{2} + ε},

(9)

where

F_{(c, m)}^{'}

represents the feature response value at the position, and

ε

=

1 \times 10^{- 9}

serves as a numerical stability.

The standardized transformation for each channel feature map converts its distribution to zero mean and unit variance:

{\hat{F}}_{(c, m)} = \frac{F_{(c, m)}^{'} - μ_{c}}{σ_{c}} .

(10)

where

{\hat{F}}_{(c, m)}

represents the standardized transformation for the feature response value at the position.

Finally, the structure is restored as

R^{N \times C \times H \times W}

through reverse transformation.

Given the standardized Gaussian transformation operation, denoted as

G a u s s (\cdot)

, we can convert the original student and teacher feature maps into normalized representations:

F_{s}^{G a u s s} = G a u s s (F_{s}), F_{T}^{G a u s s} = G a u s s (F_{T}),

(11)

where

F_{s}^{G a u s s}

and

F_{T}^{G a u s s}

represent the feature maps of Gaussian transformation for student and teacher networks, respectively.

As shown in Figure 7, taking one feature map as an example, we present visualizations of energy heatmaps and energy-probability distributions for feature maps of teacher–student networks before and after standard Gaussian transformation. The standard Gaussian transformation ensures the comparability of features across different models, allowing for a sharper focus on relative structural differences.

3.5. Loss Function

To measure the feature difference between teacher and student networks, a natural approach is directly minimize their difference. In this paper, we propose a Perceptual Weighted Mean Square Error (PWMSE) distillation loss. This distillation loss applies spatial attention weight map from DSA module to modulate the squared feature differences, enabling student networks to prioritize feature learning and model weight updates for small targets. Specifically, we calculate the squared differences between teacher and student features, then apply spatial attention weights obtained from the DSA mechanism, and finally aggregate the mean values across all feature maps to formulate the distillation loss. This process can be represented as follows:

L o s s_{k d} = \sum_{i = 1}^{n} L o s s_{i} = \sum_{i = 1}^{n} \frac{1}{H \times W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} M_{D S A}^{F_{T}} (x, y) {[F_{T}^{G a u s s} (x, y) - F_{S}^{G a u s s} (x, y)]}^{2},

(12)

where n represents the total number of feature maps, H and W represent the height and width spatial dimensions, respectively.

The distillation loss

L o s s_{k d}

assigns higher spatial attention weights to critical target areas, thereby enabling greater contribution from small target areas to the optimization process. During training, the total loss for the student network contains two components: the knowledge distillation loss and the detection task loss. For IRSTD, the detection task loss is typically computed using with SoftIoU [2]. The total training loss for student network can be represented as:

L o s s = α L o s s_{\det} + β L o s s_{kd},

(13)

where

α

and

β

represent the coefficients for the detection task loss and the knowledge distillation loss, respectively.

4. Experiments

In this section, we first introduce the evaluation metrics and implementation details. Then, we compare the detection performance with several state-of-the-art knowledge distillation methods (e.g., CrossKD [16], PKD [15], MGD [14]) on the benchmark IRSTD models (e.g., DNANet [2], AMFU-Net [23], DMFNet [24]). After that, we present ablation studies to investigate our method. Finally, we validate the inference latency improvements achieved by the proposed method on edge devices (NVIDIA AGX and HUAWEI Ascend-310B). The code is available at https://github.com/lyh-nudt/DSAD (accessed on 29 September 2025).

4.1. Evaluation Metrics

We adopt the same evaluation metric as previous benchmark IRSTD methods [2,23,24]. Specifically, the detection probability (

P_{d}

) and false alarm rate (

F_{a}

) were used to evaluate the target localization accuracy, the Intersection over Union (

I o U

) was used to evaluate the target pixel contour accuracy, the number of parameters (Params) and floating-point operations (FLOPs) were used to evaluate the memory and computation consumption.

4.1.1. Intersection over Union

Intersection over Union (

I o U

) is a pixel-level evaluation metric. The ratio of intersection and union between all predicted areas and real areas reflects the accuracy of object shape description.

I o U = \frac{A_{int e r}}{A_{u n i o n}},

(14)

where

A_{i n t e r}

and

A_{U n i o n}

represent the interaction areas and union areas, respectively.

4.1.2. Probability of Detection

Probability of Detection (

P_{d}

) refers to the ratio of the number of correctly detected targets

T_{c o r r e c t}

to the number of all targets

T_{A l l}

, which reflects the localization accuracy of the model for the correct target instance.

P_{d}

is defined as follows:

P_{d} = \frac{T_{c o r r e c t}}{T_{A l l}} .

(15)

4.1.3. False-Alarm Rate

False alarm rate (

F_{a}

) represents the ratio of incorrectly detected pixels

P_{f a l s e}

to the whole image pixels

P_{A l l}

, which reflects the false alarm of the model for the wrong target instance.

F_{a}

is defined as follows:

F_{a} = \frac{P_{f a l s e}}{P_{A l l}},

(16)

where the deviation threshold is set as 3 in this paper to maintain consistency with previous work [2].

4.1.4. Parameters

Params

refers to the total amount of all parameters in the model and is a key metric to evaluate model complexity and memory usage.

4.1.5. Floating-Point Operations

FLOPs refers to the total number of all floating-point operations performed during the computation, which is commonly used to measure computational complexity in neural networks.

4.2. Datesets and Implementation Details

In this paper, we used the benchmark NUDT-SIRST [2] and NUAA-SIRST [43] datasets for both training and testing. NUAA-SIRST is the first public dataset with high-quality images and accurately labeled infrared small targets, containing 427 images with a total of 480 instances. NUDT-SIRST consists of 1327 challenging infrared small target images. It has a wider range of target categories, target sizes, complex backgrounds, and accurate target annotations. Each dataset is split into training and testing sets in a 1:1 ratio. The datasets division were based on previous works [2,24]. Sufficient test images are crucial for evaluating the real model performance, so we followed the train-to-test ratio to 1 for NUAA-SIRST dataset (i.e., 213 images for training and 214 images for testing) and also adopted the train-to-test ratio to 1 for NUDT-SIRST dataset (i.e., 663 images for training and 664 images for testing). Before training, all input images are normalized and then preprocessed, including standard operations such as flipping, blurring, and cropping, the image resolution is adjusted to 256 × 256 for training.

The training settings of each benchmark model network followed the settings in the original papers [2,23]. The training settings for knowledge distillation of each student network were optimized by the Adagrad method [44] with the CosineAnnealingLR scheduler. We set the learning rate, batch size, and epoch size as 0.05, 16, and 1500, respectively. The ratio of task loss and distillation loss are set as

α : β = 1 : 2

. All experimental results were implemented in PyTorch [45] on a computer with an Intel i7-13650HX CPU and an NVIDIA RTX 3090 GPU.

4.3. Comparison to the State-of-the-Art Methods

To demonstrate the superiority of our method, we compare our DSAD method to several advanced distillation methods, including KD [12], NST [38], AT [17], KDSVD [18], RKD [39], CCKD [40], MGD [14], PKD [15] and CrossKD [16]. For fair comparison, we re-evaluated all participating knowledge distillation methods on three excellent benchmark IRSTD networks for verification. The three benchmark models were DNANet [2], AMFU-Net [23], and DMFNet [24], which have demonstrated excellent performance in infrared small target detection.

The training settings and configurations of DNANet, AMFU-Net and DMFNet remain consistent with those original papers. We conducted models training using the parameter settings mentioned above.

As shown in Figure 8, it is evident that the losses of all these models exhibit a consistent decline as the number of epochs increases, demonstrating favorable convergence performance and speed. Moreover, no anomalies such as overfitting or underfitting were observed during the training, indicating that models are well trained during the distillation process.

4.3.1. Quantitative Results

Quantitative results in Table 1, Table 2 and Table 3 show that the student networks’ detection performance differed between our DSAD method and the other distillation methods.

Table 1 shows the detection results of different knowledge distillation methods based on the DNANet framework. Using the same teacher network (4.70M Params, 14.19G FLOPs) and student network (0.30M Params, 0.97G FLOPs), the student network distilled with the DSAD method outperformed other distillation methods. This indicates DSAD method achieves a more effective learning of small target features. Notably, our lightweight student network requires approximately 6.5% of the teacher network computational resources while demonstrating comparable and even better performance in small target detection metrics. For example, the DSAD-based student network achieves an

I o U

of 75.69% and a

P_{d}

of 97.34%, achieving comparable performance for the teacher network. The DSAD-based student network achieves 84.79%/75.69% in term of

I o U

, 98.10%/97.34% in term of

P_{d}

on the two benchmark datasets (NUDT-SIRST and NUAA-SIRST), respectively. Compared to the second-best distillation method (PKD), our DSAD method improves

I o U

increases by 1.17%/1.04%, and increases

P_{d}

by 0.11%/1.14%. Compared to the baseline distillation method (KD), our DSAD method improves

I o U

increases by 4.09%/1.31%, and increases

P_{d}

by 1.17%/2.28%. Due to the complex background in infrared images, multi-directional capture may introduce some unnecessary false alarms. Despite this, the false alarm rate of the DSAD-based student network still remains at an acceptable level.

Table 2 presents the detection results of different knowledge distillation methods based on the AMFU-Net framework. To simplify the experimental comparison, we select three excellent distillation methods, MGD, PKD, and CrossKD, for comparison. Using the same teacher network (1.88M Params, 22.73G FLOPs) and student network (0.12M Params, 1.48G FLOPs), the student network distilled with the DSAD method outperformed other excellent distillation methods. The DSAD-based student network achieves 80.79%/75.26% in terms of

I o U

, 97.14%/95.06% in terms of

P_{d}

on two benchmark datasets (NUDT-SIRST and NUAA-SIRST), respectively. On these datasets, our DSAD method surpasses the second-best distillation method (PKD) and the baseline method (KD) in small target metrics (e.g.,

I o U

,

P_{d}

), while the false alarm rate remains at an acceptable level.

Table 3 presents the detection results of different knowledge distillation methods based on the DMFNet framework. We also select three excellent distillation methods, MGD, PKD and CrossKD for simplify comparison. Using the same teacher network (11.11M Params, 40.21G FLOPs) and student network (0.70M Params, 2.62G FLOPs), the student network distilled with the DSAD method outperformed other excellent distillation methods. The DSAD-based student network achieves 84.30%/74.16% in term of

I o U

, 98.20%/96.20% in term of

P_{d}

on two benchmark datasets, respectively. On the two benchmark datasets, our DSAD method surpasses the second-best distillation method (CrossKD) and the baseline method (KD) in small target metrics (e.g.,

I o U

,

P_{d}

), and the false alarm rate is the best.

4.3.2. Qualitative Results

Qualitative results on the datasets are shown in Figure 9. We select several complex infrared scenarios to compare detection results generated by different methods. Compared with other distillation methods, our DSAD method achieves better target localization and shape segmentation performance under a low false alarm rate at the same teacher network guiding. In contrast, models distilled by other methods easily generate lots of false alarm and missed areas in local highlight areas (e.g, img2 and img4). Model distilled by DSAD method enables precise target contour segmentation and reliable detection of small targets in complex scenarios, achieving higher detection capability. Moreover, 3D visualization results of our method and other methods are shown in Figure 10. Our method achieves better detection under the same teacher network guiding. Visualization results of img2, img3, and img6 demonstrate the robustness of our method on the dense target scenarios.

Qualitative results show that our DSAD method enables the student network to achieve better feature knowledge learning under the same teacher and student frameworks, thereby achieving accurate small target localization and shape segmentation performance, while maintaining multi-scale sensitivity for small targets.

As shown in Figure 11, for small targets with complex cloud and land backgrounds, the results predicted by distilled model exhibit both false alarms and miss detection. In the scene with thick clouds, targets and bright textures exhibit similar grayscale values. The contrast calculation fails to distinguish them and misidentifies the cloud texture as the target, causing the distillation loss to assign high weight to the cloud areas. In the mountainous vegetation scene, the small target response is weakened by shadows or overlapping vegetation. The contrast of small target areas is not prominent, and the Gaussian transformation relies on the feature magnitude without enhancing the weak signal of target, thereby diminishing the ability to capture target features.

4.4. Ablation Study

In this subsection, we compare our DSAD method with several variants to investigate the potential benefits introduced by our strategies and design choice.

4.4.1. The Effectiveness of the DSA Mechanism in DSAD Method

In our DSAD method, the DSA mechanism is employed to capture the spatial weights from teacher features and integrate these weights into the distillation loss (PWMSE) for cross-network knowledge transfer. To validate the effectiveness of DSA mechanism in DSAD framework, we design three experimental groups, corresponding to different spatial weight settings. The first group introduces no spatial weights, setting coefficients to 1. The second group obtains spatial weights coefficients by traditional spatial attention (SA [19]). The third group uses our DSA to acquire spatial weight coefficients. The experimental results are summarized in Table 4.

As shown in Table 4, the third group (

M_{D S A}^{F_{T}}

×SE(T,S)) achieves better detection performance (

I o U

and

P_{d}

) than others. That is because, the DSA module adopts multi-directional convolution kernels to capture directional feature contrasts in spatial information, where the generated spatial weights can effectively enhance small target areas while suppressing background areas. Compared with the first group, the DSA in distillation framework (i.e., the DSAD method) improves

I o U

by 3.58%/1.81% and increases

P_{d}

by 0.64%/1.9% on the two benchmark datasets. The DSA mechanism in DSAD method effectively capture multi-directional spatial gradients of small targets, and distinguish small targets from complex backgrounds by calculating minimum directional contrast, thereby guiding the PWMSE distillation loss to assign high weights for target regions and suppress background interference. In DSAD method, the DSA mechanism enhances the representation ability of small targets features, which is helpful to the detection performance.

4.4.2. The Effectiveness of the DSA Module for IRSTD Models

As the DSA is a form of spatial attention, it can be independently integrated into model framework to enhance small target detection performance. To investigate its impact, we integrate DSA module into these three IRSTD teacher and student networks and investigate the impact of DSA on small target detection performance of IRSTD models.

Table 5 presents the performance results of each model before and after the addition of DSA module. The experimental results show that the integration of DSA significantly enhances small target detection performance across all benchmark models without additional computational overhead. For instance, the DNANet-num16 model with DSA achieves an impressive 87.42%

I o U

and 99.05%

P_{d}

, representing an increase of 1.29% in

I o U

and 0.53% in

P_{d}

compared to the original model. Similarly, AMFU-Net and DMFNet exhibit performance gains after integrating DSA. The DSA module addresses “weak target gradient extraction and background suppression”, rather than relying on specific models. By taking the minimum contrast at each spatial location, the DSA module preserves real small targets with high multi-directional contrasts. Instead of being discarded, the small targets spatial weights are further strengthened, enabling more precise target localization and features transferring. These findings validate the effectiveness of the DSA module for IRSTD and highlight its inherent advantage.

4.4.3. The Effectiveness of DSA Module Collaborate DSAD Method

These above findings show that the DSA module and the DSAD distillation method effectively enhance detection performance. To further investigate their potential effects, we explored whether their collaborative application to student networks can yield remarkable results. Therefore, keeping the teacher network constant, we constructed two types of student networks: one has the integrated DSA module, and the other is the original student network. Both were trained using the DSAD method to acquire feature knowledge from teacher network.

Table 6 presents the detection results of the original student networks and student networks integrated with DSA module using the DSAD distillation method. The results show that student networks with DSA module with DSAD distillation consistently outperforms pure DSAD distillation, approaching the performance level of the teacher networks. Most notably, +DNA-num4 with DSAD method achieves 75.88%

I o U

and 98.10%

P_{d}

on NUAA-SIRST dataset, indicating detection performance comparable to or even surpassing those of teacher networks. Furthermore, +AMFU-num4 and +DMF-num4 also achieve amazing results. The core components of the integrated DSAD distillation strategy exhibit a synergistic effect: Gaussian transformation mitigates amplitude interference to ensure effective feature contrast, while DSA provides precise spatial attention to highlight target regions. PWMSE integrates both mechanisms to facilitate efficient knowledge transfer. Only the complete combination demonstrates superior performance. These findings validate the effectiveness of the designed DSA module and DSAD distillation method.

4.5. Inference Experiments on Edge Platforms

To investigate the inference efficiency of the lightweight student models on edge computing platforms, we conduct inference tests on mainstream edge hardware platforms (NVIDIA AGX and HUAWEI Ascend-310B). We deploy these teacher and student models (DNA-num16 and DNA-num4, AMFU-num16 and AMFU-num4, DMF-num16 and DMF-num4) on these hardware platforms. Subsequently, we perform average inference time tests on single infrared images from the benchmark datasets mentioned. Figure 12 presents the inference performance of 256 × 256-pixel infrared images by edge devices, and we cover the obtained results for small targets with red color.

As shown in Table 7, lightweight student models achieve efficient inference on edge computing platforms through reduced inference latency and enhanced acceleration rates. Due to the computing power (32TOPs) and TRT inference acceleration of NVIDIA AGX, the overall inference of these models is slightly faster than that of HUAWEI Ascend-310B (20TOPS). On the NVIDIA AGX, the DNA-num4 student model has reached a 0.30 M parameters and 0.97 G FLOPs, with the inference time for a single image being 17.27 ms, and achieves a 62.39% improvement in inference efficiency (speedup = 2.66×). It achieves the best inference speed for a single 256 × 256 image among these models. The inference time for a single image of the AMFU-num4 student model has been reduced to 27.98 ms, which has increased the inference efficiency by 61.45% (speedup = 2.59×) compared to AMFU-num16 teacher model. The DMF-num4 student model achieves an inference time of 33.25 ms and improves the inference efficiency by 66.28% (speedup = 2.96×), which has reached the best speedup ratio among these student models. On the HUAWEI Ascend-310B, the DNA-num4 student model has an inference time of 19.84 ms for a single image, achieving a 63.93% improvement in inference efficiency (speedup = 2.77×). Meanwhile, the DMF-num4 student model achieves an inference time of 40.90 ms, improving inference efficiency by 64.32% (speedup = 2.80×). These student model could maintain comparable detection performance while achieving over 2× acceleration on edge devices for single 256 × 256 image, thereby providing technical insights for accelerating the sliced inference of large-scale infrared images in real scenarios.

5. Conclusions

In this paper, we propose a Multi-Directional Contrast Spatial Attention-driven Feature Distillation (DSAD) method to achieve lightweight and high performance IRSTD models. Specifically, we first select intermediate feature maps from the teacher and student networks and normalize them into a standard Gaussian distribution. Then, a Multi-Directional Contrast Spatial Attention (DSA) is designed to capture the spatial information of teacher features. Finally, the distillation loss PWMSE is formulated to guide the lightweight student network to focus on target areas. In this way, the response of layers in lightweight student network to small targets has been enhanced. Extensive experiments on public datasets demonstrate the effectiveness of our method, while deployment experiments on edge platforms show the student networks can achieve 2× more acceleration in inference latency.

Despite the interesting findings and promising results, this study still has some potential limitations waiting to be explored. Our method currently focuses on CNN-based networks applied to single-frame infrared small target detection, whereas extending Transformer-based IRSTD networks or multi-frame infrared small target detection field deserves further exploration. In the DSAD method, the multi-directional contrast extraction has limitations in some complex scenes, and it can be optimized into a dynamic perception extraction pattern, not restricted to fixed discrete directions. In our future research, we intend to explore across model architectures knowledge distillation methods (e.g., distillation from Transformer-based networks to CNN-based networks), multimodal knowledge distillation methods across tasks, and inference acceleration on edge computation clusters, thereby enhancing the efficient deployment and inference of large vision models for IRSTD on edge computing platforms. We hope this work will draw attention to the research on knowledge distillation for IRSTD and lightweight model deployment in practical applications.

Author Contributions

Conceptualization, Y.L. and J.C.; investigation, Y.L., B.L., J.C., G.Z., S.D. and H.Z.; methodology, Y.L., G.Z. and J.C.; project administration, J.C.; software, Y.L. and B.L.; supervision, B.L. and J.C.; validation, Y.L. and B.L.; visualization, Y.L., B.L. and S.D.; writing—original draft, Y.L., B.L. and J.C.; writing—review and editing, Y.L., B.L., J.C., G.Z., S.D. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC) (No. 62101567, No. 62401589, No. 62501618, No. 4250013163).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author and the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Tang, W.; Dai, Q.; Hao, F. An efficient knowledge distillation-based detection method for infrared small targets. Remote Sens. 2024, 16, 3173. [Google Scholar]
Li, Z.; Xu, P.; Chang, X.; Yang, L.; Zhang, Y.; Yao, L.; Chen, X. When object detection meets knowledge distillation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10555–10579. [Google Scholar] [CrossRef]
Deng, L.; Li, G.; Han, S.; Shi, L.; Xie, Y. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 2020, 108, 485–532. [Google Scholar] [CrossRef]
Zhang, M.; Wang, Y.; Guo, J.; Li, Y.; Gao, X.; Zhang, J. IRSAM: Advancing segment anything model for infrared small target detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 233–249. [Google Scholar]
Deng, H.; Sun, X.; Liu, M.; Ye, C.; Zhou, X. Small infrared target detection based on weighted local difference measure. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4204–4214. [Google Scholar]
Cheng, H.; Zhang, M.; Shi, J.Q. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Trans. Pattern Anal. Mach. Intell. 2024. early access. [Google Scholar]
He, Y.; Xiao, L. Structured pruning for deep convolutional neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2900–2919. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 171–184. [Google Scholar]
Yang, J.; Shen, X.; Xing, J.; Tian, X.; Li, H.; Deng, B.; Huang, J.; Hua, X.-S. Quantization networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
Li, B.; Wang, Y.; Wang, L.; Zhang, F.; Liu, T.; Lin, Z.; An, W.; Guo, Y. Monte Carlo linear clustering with single-point supervision is enough for infrared small target detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 1009–1019. [Google Scholar]
Yang, Z.; Li, Z.; Shao, M.; Shi, D.; Yuan, Z.; Yuan, C. Masked generative distillation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 53–69. [Google Scholar]
Cao, W.; Zhang, Y.; Gao, J.; Cheng, A.; Cheng, K.; Cheng, J. PKD: General distillation framework for object detectors via Pearson correlation coefficient. Adv. Neural Inf. Process. Syst. 2023, 35, 15394–15406. [Google Scholar]
Wang, J.; Chen, Y.; Zheng, Z.; Li, X.; Cheng, M.-M.; Hou, Q. CrossKD: Cross-head knowledge distillation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16520–16530. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Xu, G.; Liu, Z.; Li, X.; Loy, C.C. Knowledge distillation meets self-supervision. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 588–604. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Zhu, H.; Liu, S.; Deng, L.; Li, Y.; Xiao, F. Infrared small target detection via low-rank tensor completion with top-hat regularization. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1004–1016. [Google Scholar] [CrossRef]
Chung, W.Y.; Lee, I.H.; Park, C.G. Lightweight infrared small target detection network using full-scale skip connection U-Net. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7000705. [Google Scholar]
Guo, T.; Zhou, B.; Luo, F.; Zhang, L.; Gao, X. DMFNet: Dual-encoder multistage feature fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5614214. [Google Scholar]
Li, B.; Wang, L.; Wang, Y.; Wu, T.; Lin, Z.; Li, M.; An, W.; Guo, Y. Mixed-precision network quantization for infrared small target segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5000812. [Google Scholar]
Li, B.; Ying, X.; Li, R.; Liu, Y.; Shi, Y.; Li, M.; Zhang, X.; Hu, M.; Wu, C.; Zhang, Y.; et al. ICPR 2024 competition on resource-limited infrared small target detection challenge: Methods and results. In Proceedings of the International Conference on Pattern Recognition, Kolkata, India, 1–5 December 2024; pp. 62–77. [Google Scholar]
Cai, Z.; Xing, S.; Quan, S.; Su, X.; Wang, J. A power-distribution joint optimization arrangement for multi-point source jamming system. Results Eng. 2025, 27, 106856. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2022, 32, 364–376. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. UNet 3+: A full-scale connected UNet for medical image segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Zhang, M.; Yue, K.; Zhang, J.; Li, Y.; Gao, X. Exploring feature compensation and cross-level correlation for infrared small target detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 1857–1865. [Google Scholar]
Zhang, M.; Yang, H.; Guo, J.; Li, Y.; Gao, X.; Zhang, J. IRPruneDet: Efficient infrared small target detection via wavelet structure-regularized soft channel pruning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; pp. 7224–7232. [Google Scholar]
Zhang, M.; Yue, K.; Li, B.; Guo, J.; Li, Y.; Gao, X. Single-frame infrared small target detection via Gaussian curvature inspired network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005013. [Google Scholar]
Wang, L.; Yoon, K.-J. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3048–3068. [Google Scholar] [CrossRef]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Bengio, Y. FitNets: Hints for Thin Deep Nets. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Huang, Z.; Wang, N. Like what you like: Knowledge distill via neuron selectivity transfer. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3967–3976. [Google Scholar]
Peng, B.; Jin, X.; Liu, J.; Li, D.; Wu, Y.; Liu, Y.; Zhou, S.; Zhang, Z. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5007–5016. [Google Scholar]
Zhang, W.; Feng, W.; Li, M.; Lyu, S.; Xu, T.-B. A saliency-transformer combined knowledge distillation guided network for infrared small target detection. In Proceedings of the International Conference on Signal and Information Processing, Networking and Computers, Beijing, China, 15–17 December 2022; pp. 88–95. [Google Scholar]
Xue, J.; Li, J.; Han, Y.; Wang, Z.; Deng, C.; Xu, T. Feature-based knowledge distillation for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6005305. [Google Scholar] [CrossRef]
Dai, Y.; Li, X.; Zhou, F.; Qian, Y.; Chen, Y.; Yang, J. One-stage cascade refinement networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000917. [Google Scholar] [CrossRef]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 8026–8037. [Google Scholar]

Figure 1. Computation cost (e.g., FLOPs) and detection performance (e.g., Pd) with our proposed DSAD method on several benchmark IRSTD models. Models generated by DSAD method can achieve low computational costs while maintaining comparable detection performance.

Figure 2. Illustration of the Multi-Directional Contrast Spatial Attention-driven Feature Distillation (DSAD) method. First, we select and align intermediate features from teacher and student networks. Then, we apply the standard Gaussian transformation to feature maps to eliminate magnitude effects. After that, spatial weights derived from DSA based on teacher features and feature discrepancies are utilized to construct the PWMSE distillation loss. Finally, the overall training loss for the student network is formulated to guide optimization.

Figure 3. The two mainstream categories of existing knowledge distillation. (a) Feature distillation extracts features of intermediate layer as supervision, minimizing the feature differences between the teacher and student network. (b) Logit distillation directly uses the teacher output prediction as supervision to guide the student learning process.

Figure 4. The energy distribution maps of the intermediate feature maps of the input image in teacher and student networks.

Figure 5. Illustration of the Multi-Directional Contrast Spatial Attention (DSA) module. First, (a) spatial aggregation is performed. Then, (b) multi-directional convolution kernels are used for feature extraction. After that, (c) directional feature contrasts are calculated. Finally, (d) the minimum contrast value is selected and activated to extract the spatial information of the input features.

Figure 6. The composition of multi-directional convolution kernels. (a) shows eight discrete directional convolution kernels, and (b) shows the direction distribution of multi-directional convolution kernels.

Figure 7. Heatmaps and energy-probability distributions of teacher–student feature maps before and after standard Gaussian normalization.

Figure 8. Convergence of the model-distillation training process.

Figure 9. Qualitative results generated by different knowledge distillation methods. Correctly detected targets, false alarm and missed detection areas are highlighted by red, yellow, and green dashed circles, respectively. Our DSAD is able to generate outputs with precise target localization and shape segmentation under a low false alarm rate.

Figure 10. Corresponding 3D visualization results of different knowledge distillation methods.

Figure 11. Some failure results generated by our DSAD method. (a) Qualitative failure results generated by our method. (b) Corresponding 3D visualization of the failure results.

Figure 12. Inference images of the teacher and student models on NVIDIA AGX and HUAWEI Ascend-310B platforms, and target results are covered in red.

Table 1. Performanceof different knowledge distillation methods based on the DNANet framework on NUAA-SIRST and NUDT-SIRST datasets. All student networks have the same structure. The best results are in red and the second-best results are in blue.

Method	Params	FLOPs	NUDT-SIRST			NUAA-SIRST
Method	Params	FLOPs	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )
DNA-num16	4.70 M	14.19 G	86.13%	98.52%	0.250	75.86%	96.58%	1.38
DNA-num4	0.30 M	0.97 G	79.68%	97.35%	1.13	74.29%	95.82%	2.25
KD [12]	0.30 M	0.97 G	80.70%	96.93%	1.25	74.38%	95.06%	1.31
NST [38]	0.30 M	0.97 G	83.15%	97.46%	0.859	75.13%	95.44%	1.38
AT [17]	0.30 M	0.97 G	81.39%	97.04%	1.20	74.96%	95.82%	1.49
KDSVD [18]	0.30 M	0.97 G	82.74%	97.20%	0.657	74.51%	96.19%	1.80
RKD [39]	0.30 M	0.97 G	81.78%	97.35%	1.04	75.23%	95.06%	1.54
CCKD [40]	0.30 M	0.97 G	81.73%	96.19%	0.703	73.53%	94.68%	2.37
MGD [14]	0.30 M	0.97 G	83.04%	97.99%	1.35	73.34%	95.44%	3.02
PKD [15]	0.30 M	0.97 G	83.62%	97.99%	1.27	74.65%	96.20%	1.25
CrossKD [16]	0.30 M	0.97 G	83.36%	97.79%	1.13	74.66%	95.81%	1.57
DSAD (ours)	0.30 M	0.97 G	84.79%	98.10%	0.882	75.69%	97.34%	1.65

DNA-num16 represents the teacher model, where num16 indicates that the number of channels in this model is set to [16, 32, 64, 128, 256]; DNA-num4 represents the student model, where num4 indicates that the number of channels in this model is set to [4, 8, 16, 32, 64].

Table 2. Performance of different knowledge distillation methods based on the AMFU-Net framework on NUAA-SIRST and NUDT-SIRST datasets. All student networks have the same structure. The best results are in red and the second-best results are in blue.

Method	Params	FLOPs	NUDT-SIRST			NUAA-SIRST
Method	Params	FLOPs	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )
AFMU-num16	1.88 M	22.73 G	83.21%	97.57%	0.855	75.62%	95.43%	1.65
AMFU-num4	0.12 M	1.48 G	78.98%	95.45%	1.15	73.13%	93.92%	1.23
KD [12]	0.12 M	1.48 G	79.15%	95.66%	1.15	73.66%	93.92%	1.79
MGD [14]	0.12 M	1.48 G	79.29%	96.40%	1.59	74.55%	94.29%	1.64
PKD [15]	0.12 M	1.48 G	79.89%	96.72%	1.88	74.86%	94.68%	1.28
CrossKD [16]	0.12 M	1.48 G	79.36%	96.19%	0.850	74.15%	94.67%	1.33
DSAD (ours)	0.12 M	1.48 G	80.79%	97.14%	1.73	75.26%	95.06%	0.920

AMFU-num16 represents the teacher model, where num16 indicates that the number of channels in this model is set to [16, 32, 64, 128, 256]; AMFU-num4 represents the student model, where num4 indicates that the number of channels in this model is set to [4, 8, 16, 32, 64].

Table 3. Performance of different knowledge distillation methods based on the DMFNet framework on NUAA-SIRST and NUDT-SIRST datasets. All student networks have the same structure. The best results are in red and the second-best results are in blue.

Method	Params	FLOPs	NUDT-SIRST			NUAA-SIRST
Method	Params	FLOPs	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )
DMF-num16	11.11 M	40.21 G	86.43%	98.41%	0.398	75.03%	96.96%	1.40
DMF-num4	0.70 M	2.62 G	80.94%	96.93%	1.29	72.29%	95.82%	1.20
KD [12]	0.70 M	2.62 G	81.50%	97.67%	1.39	72.57%	95.82%	1.38
MGD [14]	0.70 M	2.62 G	82.74%	97.99%	1.06	73.14%	96.19%	2.61
PKD [15]	0.70 M	2.62 G	83.40%	97.88%	0.597	73.58%	96.09%	1.51
CrossKD [16]	0.70 M	2.62 G	83.70%	98.09%	3.86	73.33%	96.20%	1.38
DSAD (ours)	0.70 M	2.62 G	84.30%	98.20%	0.489	74.16%	96.20%	1.33

DMF-num16 represents the teacher model, where num16 indicates that the number of channels in this model is set to [16, 32, 64, 128, 256]; DMF-num4 represents the student model, where num4 indicates that the number of channels in this model is set to [4, 8, 16, 32, 64].

Table 4. Performance of different perceptual weighted distillation losses based on the DNANet framework on NUAA-SIRST and NUDT-SIRST datasets. All student networks have the same structure. The best results are in red.

Method	Params	FLOPs	NUDT-SIRST			NUAA-SIRST
Method	Params	FLOPs	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )
DNA-num16	4.70 M	14.19 G	86.13%	98.52%	0.250	75.86%	96.58%	1.38
DNA-num4	0.30 M	0.97 G	79.68%	97.35%	1.13	74.29%	95.82%	2.25
1×SE(T,S)	0.30 M	0.97 G	82.21%	97.46%	0.480	73.88%	95.44%	2.27
$M_{S A}^{F_{T}}$ ×SE(T,S)	0.30 M	0.97 G	83.31%	97.35%	1.03	74.74%	96.57%	1.42
$M_{D S A}^{F_{T}}$ ×SE(T,S)	0.30 M	0.97 G	84.79%	98.10%	0.882	75.69%	97.34%	1.65

SE(T,S) represents the squared difference between teacher and student feature maps of Gaussian transformation. The three experimental configurations follow the settings described above.

Table 5. Performance comparison of different IRSTD models by using DSA module on NUDT-SIRST datasets. The best results are in red. Using DSA module has a positive gain effect on small target detection performance.

Method	None					+DSA
Method	Params	FLOPs	$IoU$	$P_{d}$	$F_{a}$ ( $10^{- 5}$ )	Params	FLOPs	$IoU$	$P_{d}$	$F_{a}$ ( $10^{- 5}$ )
DNA-num16 (teacher)	4.70 M	14.19 G	86.13%	98.52%	0.250	4.71 M	14.05 G	87.42%	99.05%	0.322
DNA-num4 (student)	0.30 M	0.97 G	79.68%	97.35%	1.13	0.31 M	0.92 G	81.34%	98.10%	0.951
AMFU-num16 (teacher)	1.88 M	22.73 G	83.21%	97.57%	0.855	1.88 M	22.69 G	85.87%	98.09%	0.609
AMFU-num4 (student)	0.12 M	1.48 G	78.98%	95.45%	1.15	0.12 M	1.47 G	79.33%	96.30%	1.18
DMF-num16 (teacher)	11.11M	40.21 G	86.43%	98.41%	0.398	11.12M	40.21 G	86.84%	98.73%	0.625
DMF-num4 (student)	0.70 M	2.62 G	80.94%	96.93%	1.29	0.71 M	2.62 G	82.21%	97.99%	0.936

Table 6. Performance comparison of different student networks by using DSA module and DSAD method on these datasets. The best results are in red.

Method	Params	FLOPs	NUDT-SIRST			NUAA-SIRST
Method	Params	FLOPs	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )	$IoU$	$P_{d}$	$F_{a}$ (× $10^{- 5}$ )
DNA-num16	4.70 M	14.19 G	86.13%	98.52%	0.250	75.86%	96.58%	1.38
DNA-num4(DSAD)	0.30 M	0.97 G	84.79%	98.10%	0.882	75.69%	97.34%	1.65
+DNA-num4(DSAD)	0.31 M	0.92 G	84.80%	98.20%	0.575	75.88%	98.10%	1.99
AMFU-num16	1.88 M	22.73 G	83.21%	97.57%	0.855	75.62%	95.43%	1.65
AMFU-num4(DSAD)	0.12 M	1.48 G	80.79%	97.14%	1.73	75.26%	95.06%	0.920
+AMFU-num4(DSAD)	0.12 M	1.47 G	81.78%	97.35%	1.15	75.38%	95.31%	1.10
DMF-num16	11.11 M	40.21 G	86.43%	98.41%	0.398	75.03%	96.96%	1.40
DMF-num4(DSAD)	0.70 M	2.62 G	84.30%	98.20%	0.489	74.16%	96.20%	1.33
+DMF-num4(DSAD)	0.71 M	2.62 G	86.02%	98.73%	0.301	75.00%	96.96%	1.48

The ’+’ represents that the initial student networks integrated with the DSA module. For three benchmark IRSTD networks, the DSAD distillation method is applied to two different types of student networks, and detection performance is evaluated accordingly.

Table 7. Inference performance of the teacher and student models on different platforms. Inf_time refers to inference time cost for a single image; Decline refers to the reduced inference time.

Edge Platform	Metrics	DNA-num16	DNA-num4	AMFU-num16	AMFU-num4	DMF-num16	DMF-num4
	FLOPs	14.19 G	0.97 G	22.73 G	1.48 G	40.21 G	2.62 G
	Params	4.70 M	0.30 M	1.88 M	0.12 M	11.11 M	0.70 M
NVIDIA AGX (32TOPS)	Inf_time	45.84 ms	17.24 ms	72.59 ms	27.98 ms	98.61 ms	33.25 ms
	Decline	—	28.60 ms	—	44.61 ms	—	65.36 ms
	Speedup	—	2.66×	—	2.59×	—	2.96×
HUAWEI Ascend-310B (20TOPS)	Inf_time	55.00 ms	19.84 ms	82.40 ms	35.24 ms	114.64 ms	40.90 ms
	Decline	—	35.16 ms	—	47.16 ms	—	73.74 ms
	Speedup	—	2.77×	—	2.34×	—	2.80×

num16 and num4 represent the teacher and student networks mentioned above, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Li, B.; Zhang, G.; Chen, J.; Deng, S.; Zhang, H. DSAD: Multi-Directional Contrast Spatial Attention-Driven Feature Distillation for Infrared Small Target Detection. Remote Sens. 2025, 17, 3466. https://doi.org/10.3390/rs17203466

AMA Style

Li Y, Li B, Zhang G, Chen J, Deng S, Zhang H. DSAD: Multi-Directional Contrast Spatial Attention-Driven Feature Distillation for Infrared Small Target Detection. Remote Sensing. 2025; 17(20):3466. https://doi.org/10.3390/rs17203466

Chicago/Turabian Style

Li, Yonghao, Boyang Li, Guoliang Zhang, Jun Chen, Siyi Deng, and Hanxiao Zhang. 2025. "DSAD: Multi-Directional Contrast Spatial Attention-Driven Feature Distillation for Infrared Small Target Detection" Remote Sensing 17, no. 20: 3466. https://doi.org/10.3390/rs17203466

APA Style

Li, Y., Li, B., Zhang, G., Chen, J., Deng, S., & Zhang, H. (2025). DSAD: Multi-Directional Contrast Spatial Attention-Driven Feature Distillation for Infrared Small Target Detection. Remote Sensing, 17(20), 3466. https://doi.org/10.3390/rs17203466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSAD: Multi-Directional Contrast Spatial Attention-Driven Feature Distillation for Infrared Small Target Detection

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Infrared Small Target Detection

2.2. Knowledge Distillation

3. Methodology

3.1. Motivation

3.2. Features Selection

3.3. Directional Contrast Spatial Attention

3.3.1. Spatial Aggregation

3.3.2. Directional Convolution Kernel

3.3.3. Directional Feature Contrast

3.3.4. Minimum Contrast Operation

3.4. Gaussian Transformation of Feature

3.5. Loss Function

4. Experiments

4.1. Evaluation Metrics

4.1.1. Intersection over Union

4.1.2. Probability of Detection

4.1.3. False-Alarm Rate

4.1.4. Parameters

4.1.5. Floating-Point Operations

4.2. Datesets and Implementation Details

4.3. Comparison to the State-of-the-Art Methods

4.3.1. Quantitative Results

4.3.2. Qualitative Results

4.4. Ablation Study

4.4.1. The Effectiveness of the DSA Mechanism in DSAD Method

4.4.2. The Effectiveness of the DSA Module for IRSTD Models

4.4.3. The Effectiveness of DSA Module Collaborate DSAD Method

4.5. Inference Experiments on Edge Platforms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI