PSHNet: Hybrid Supervision and Feature Enhancement for Accurate Infrared Small-Target Detection

Chen, Weicong; Zhang, Chenghong; Liu, Yuan

doi:10.3390/app15147629

Open AccessArticle

PSHNet: Hybrid Supervision and Feature Enhancement for Accurate Infrared Small-Target Detection

by

Weicong Chen

^1,2,

Chenghong Zhang

^1,2 and

Yuan Liu

^1,*

¹

Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7629; https://doi.org/10.3390/app15147629

Submission received: 23 June 2025 / Revised: 4 July 2025 / Accepted: 5 July 2025 / Published: 8 July 2025

Download

Browse Figures

Versions Notes

Abstract

Detecting small targets in infrared imagery remains highly challenging due to sub-pixel target sizes, low signal-to-noise ratios, and complex background clutter. This paper proposes PSHNet, a hybrid deep-learning framework that combines dense spatial heatmap supervision with geometry-aware regression for accurate infrared small-target detection. The network generates position–scale heatmaps to guide coarse localization, which are further refined through sub-pixel offset and size regression. A Complete IoU (CIoU) loss is introduced as a geometric regularization term to improve alignment between predicted and ground-truth bounding boxes. To better preserve fine spatial details essential for identifying small thermal signatures, an Enhanced Low-level Feature Module (ELFM) is incorporated using multi-scale dilated convolutions and channel attention. Experiments on the NUDT-SIRST and IRSTD-1k datasets demonstrate that PSHNet outperforms existing methods in IoU, detection probability, and false alarm rate, achieving IoU improvement and robust performance under low-SNR conditions.

Keywords:

infrared small-target detection; deep learning; enhanced low-level features

1. Introduction

Infrared small-target detection is pivotal for applications ranging from military reconnaissance and precision-guided munitions to maritime search-and-rescue and environmental monitoring. In these scenarios, targets often occupy only a handful of pixels against cluttered, low-contrast backgrounds, making them extremely difficult to discern from noise. A missed detection can translate to mission failure or loss of life, while false alarms waste critical resources. Moreover, infrared imaging’s all-weather capability and effectiveness in low-light conditions underscore its strategic value, yet these same advantages amplify the challenge of reliably identifying minute thermal signatures under varying environmental conditions.

Traditional infrared small-target detectors—whether based on handcrafted spatial filters or on early adaptations of general object-detection CNNs—struggle when targets occupy only a few pixels amid heavy clutter or low contrast. Hand-designed filters require careful parameter tuning and tend to produce excessive false alarms under complex backgrounds, while naïve CNN adaptations lack the dense supervision and fine-grained feature preservation needed for sub-pixel localization. Even recent specialized deep models, despite improvements in overall recall and precision, still struggle on two fronts.

To bridge this gap, we introduce PSHNet, a hybrid deep-learning framework that unifies dense spatial supervision with precise geometric refinement.

(1): PSHNet addresses the lack of pixel-level guidance under sparse box-level supervision by generating a position heatmap, where each target center is encoded as a Gaussian distribution to supply dense localization cues. A dedicated offset branch then regresses sub-pixel shifts to refine the coarse heatmap peaks, while a size branch directly predicts each target’s width and height, yielding precise position–scale estimates.
(2): To counteract the loss of fine details caused by deep-layer downsampling, we incorporate an Enhanced Low-level Feature Module (ELFM) immediately after the first encoder block. ELFM applies multi-scale dilated convolutions to capture contextual information across varying receptive fields and employs channel-wise attention to emphasize channels most relevant to small targets, thus preserving and amplifying critical shallow-layer spatial features.
(3): To compensate for the limited global context of heatmap-plus-regression alone, we introduce a CIoU regularization term on the reconstructed bounding boxes. This auxiliary loss enforces consistency in location, scale, and aspect ratio without overriding the dense supervision signals, ensuring that detections remain both spatially sensitive and geometrically aligned.

Through the tight integration of dense heatmap cues, geometry-aware regression, and enhanced low-level feature preservation, PSHNet reliably achieves high recall and precision in detecting minute infrared targets under complex backgrounds and low-signal conditions. To further clarify the motivation behind our design, a detailed discussion of existing methods and their limitations is presented in the next section.

2. Related Work

2.1. Traditional Infrared Small-Target Detection Methods

Traditional infrared small-target detection methods typically assume that targets appear as sparse, high-intensity points on locally smooth backgrounds. These approaches are non-learning-based, operate on single-frame imagery, and rely heavily on handcrafted filters or statistical priors to enhance local contrast and suppress background clutter.

Early techniques include morphological operators such as the Top-Hat transform [1], which isolates bright points by subtracting the morphologically opened image from the original. Variants like the Multiscale and Ring Top-Hat filters improve scale adaptability but remain sensitive to parameter selection and struggle with non-uniform clutter. The Max-Mean and Max-Median filters [2] estimate local background statistics and identify salient targets based on intensity contrast, yet they require fine-tuned window sizes and may fail in highly textured scenes. Local contrast-based detectors, such as the Absolute Directional Mean Difference (ADMD) and Adaptive Gaussian Difference (AGD), enhance target visibility by emphasizing directional contrast or Gaussian-filtered differences [3,4].

Other spatial-domain methods include Gaussian Curvature Filtering (GCF) [5], which preserves blob-like structures while smoothing out background patterns, and Local Steering Kernel (LSK) reconstruction [6], which fits a locally adaptive model to suppress structured clutter. Frequency-domain approaches, like the Phase Spectrum of the Fourier Transform (PFT) [7], highlight spatial anomalies through phase-only information, achieving good background suppression but limited spatial precision. Matched filtering methods maximize the signal-to-noise ratio based on the assumed point-spread characteristics of the target [8], yet they are template-dependent and vulnerable to shape variation. Overall, while traditional methods offer low computational cost and require no training data, their performance degrades significantly in the presence of complex clutter, dim targets, or highly dynamic backgrounds. Moreover, they lack adaptability to varying environments and often produce high false alarm rates due to the limited discriminative capacity of handcrafted features.

2.2. Deep Learning-Based Infrared Small-Target Detection Methods

In recent years, deep learning has significantly advanced infrared small-target detection by leveraging data-driven feature extraction and end-to-end optimization. Compared with traditional handcrafted methods, deep models offer greater adaptability to cluttered scenes and complex environmental variations. Existing approaches can be broadly categorized based on their architectural foundations and learning strategies.

(1): Detection-Based Frameworks (Box-Level Supervision):

Several studies have adapted mainstream object detectors such as Faster R-CNN and YOLO to infrared data. For example, IAA-Net employs a two-stage pipeline with a ResNet-18-based region proposal module and a cross-channel attention mechanism to refine target localization [9]. YOLO-FR enhances YOLOv5 with a feature reassembly module to improve small target saliency [10], while YOLOSR-IST combines super-resolution preprocessing with a lightweight YOLO detection head to increase recall under low SNR conditions [11]. Anchor-free variants, such as ISTD-CenterNet, utilize keypoint-based regression to estimate target centers and sizes without predefined anchor boxes, offering improved adaptability for irregular or sub-pixel targets [12].

(2): Segmentation-Style Networks (Pixel-Level Supervision):

Many recent approaches adopt encoder–decoder architectures to produce pixel-level predictions of small target regions. For instance, MDvsFA introduces a dual-branch adversarial framework that balances missed detections (MD) and false alarms (FA) through class-specific discriminators [13]. ALCNet incorporates a local contrast prior into a U-shaped network and applies bottom-up attention to retain fine spatial features [14]. DNANet features a densely nested attention structure that enhances multi-scale fusion and suppresses redundant noise, outperforming many anchor-based methods [15]. Other segmentation-style methods include MSHNet and ISNet, which introduce hierarchical skip-connections or shape-guided supervision to recover target geometry and location [16,17].

(3): Transformer-Based Models:

Recently, Transformer architectures have been applied to enhance global feature modeling in infrared detection tasks. FTC-Net integrates CNN and Transformer branches to combine local spatial precision with long-range contextual cues, improving robustness in cluttered backgrounds [18]. TCI-Former further introduces a thermal-conduction-inspired module to encode spatial propagation dynamics of thermal signals, aiding weak target discrimination [19].

(4): Specialized Training Mechanisms:

Adversarial training is exemplified by MDvsFA [13], while self-supervised learning (SSL) has been explored in SSL-YOLO, where instance discrimination pretraining improves target generalization with limited labeled infrared data [20]. Point-supervised methods, like PSMNet, reduce annotation effort by using only center points as labels and employing uncertainty modeling to guide detection [21].

(5): Multimodal and Data-Augmented Techniques:

Methods such as ESM-YOLO fuse visible and infrared inputs via dual-branch encoders and attention-based feature fusion, enabling improved cross-modality saliency [22]. Synthetic infrared datasets and targeted data augmentation strategies (e.g., simulated injection, clutter modeling) have also been proposed to enhance generalization under limited data regimes [23].

In summary, deep learning has enabled substantial progress in infrared small-target detection.

2.3. Limitations of Existing Methods

Despite significant advances in infrared small-target detection, current techniques—both traditional and deep learning-based—remain challenged by several fundamental limitations:

Low SNR and Extremely Small Target Size: Infrared targets are often just a few pixels and exhibit very low signal-to-noise ratios, making them vulnerable to suppression during both handcrafted filtering and deep feature downsampling.

Supervision and Localization Gaps: Deep learning methods typically rely on box-level labels, providing weak supervision that fails to localize sub-pixel targets accurately. This leads to spatial misalignment and missed detections, as no mechanism enforces dense or pixel-precise guidance.

Loss of Fine-Scale Texture: In CNN architectures, repeated pooling and convolution reduce the visibility of subtle spatial details. Shallow layers that retain this information are either underutilized or merged too late, resulting in lost texture essential for small target discrimination.

Lack of Geometric Constraints: Most detection frameworks regress bounding boxes independently without explicit geometric consistency (size, aspect ratio). This absence of structure-aware regularization increases localization drift, especially under clutter and noise.

Together, these limitations underscore the need for a hybrid architecture that integrates dense spatial supervision, fine-detail preservation, and global geometric regularization. PSHNet is designed to meet exactly this requirement.

3. Proposed Method

3.1. Overall Architecture

Infrared small target detection poses significant challenges due to the low signal-to-noise ratio, the presence of complex backgrounds, and the extremely small size of the targets, which often span only a few pixels. To address these difficulties, we propose a novel hybrid framework named PSHNet (Position–Scale Heatmap Network), which combines the dense localization capability of segmentation-style heatmaps with the geometric precision of structure-aware box refinement.

The architecture of PSHNet is illustrated in Figure 1. It follows a U-Net-style encoder–decoder structure designed to preserve high-resolution spatial features throughout the network. The encoder is composed of successive convolutional layers that extract hierarchical features, while the decoder progressively restores spatial resolution by upsampling and fusing features from corresponding encoder layers. To enhance the model’s sensitivity to small, subtle features—which are typically lost during deep-layer downsampling—we introduce an Enhanced Low-level Feature Module (ELFM) after the first encoding stage. ELFM strengthens shallow features via multi-scale dilated convolutions and channel-wise attention, improving the model’s ability to capture fine-grained patterns necessary for small target detection.

As shown in Figure 1, at the end of the decoder, PSHNet integrates a multi-branch localization head, consisting of three parallel output modules:

A heatmap head, which produces a dense spatial probability map indicating likely target center locations using a Gaussian encoding.
An offset head, which refines the coarse grid-aligned center predictions by regressing sub-pixel shifts.
A size head, which predicts the width and height $(w, h)$ of each target.

These outputs are combined to form structured predictions representing the location and scale of each detected target. The heatmap ensures dense spatial coverage, while the offset and size predictions recover precise position–scale representations.

Unlike traditional detection frameworks that treat CIoU-based box regression as the primary loss, PSHNet treats CIoU as an auxiliary geometric regularization term. After the predicted centers are refined using heatmap peaks and offsets, the CIoU loss is computed between the predicted and ground-truth bounding boxes. This encourages alignment in location and shape without compromising the dense spatial supervision provided by the heatmap structure.

PSHNet thus unifies the spatial flexibility of segmentation-style prediction with the geometric rigor of box-level regularization. It is particularly suited for infrared small target scenarios, where precise localization must be achieved under conditions of minimal signal and high ambiguity. The complete network design supports accurate, interpretable, and efficient small object detection under weak visibility conditions.

3.2. Enhanced Low-Level Feature Module (ELFM)

Infrared small targets typically occupy only a few pixels and lack rich semantic context. Therefore, high-resolution spatial cues—such as sharpness, edge response, and local intensity contrast—are essential for early detection. These cues are predominantly present in the shallow layers of the encoder. However, traditional convolutional pipelines often downsample these early features too aggressively, leading to a loss of spatial resolution and detail that can hinder localization accuracy, especially for small or low-contrast objects. To address this issue, we introduce an Enhanced Low-level Feature Module (ELFM) within the PSHNet architecture, which is specifically designed to preserve and refine shallow-level information to support dense localization in the subsequent decoder stages.

The ELFM is embedded after the first encoder block and operates on high-resolution shallow features. It consists of the following key components:

Multi-scale dilated convolutions: To capture features at varying receptive fields without downsampling, ELFM applies a series of 3 × 3 convolutions with different dilation rates (e.g., 1, 2, and 4). These layers enable the network to gather spatial context from multiple scales while preserving the resolution of the feature map. The choice of dilation rates (1, 2, and 4) strikes a balance between local and mid-range spatial context aggregation, which is particularly important for small objects that lack strong semantic features. Larger dilation rates were experimentally found to produce fragmented responses or ignore localized contrast.

Channel-wise attention: Following the multi-scale fusion, a squeeze-and-excitation (SE) block is used to recalibrate feature responses across channels. This mechanism learns to emphasize channels that are more relevant for identifying target-related structures, while suppressing background noise.

Feature aggregation: The outputs of all branches are concatenated and refined using a 1 × 1 convolution to reduce dimensionality and unify the enhanced features. This produces the final output tensor, which is then passed to the decoder via skip connections.

A schematic of ELFM is shown in Figure 2. This module operates entirely in the spatially high-resolution domain, ensuring that localization-critical features are preserved and enriched. Formally, let F denote the shallow feature map; the output of the dilated convolution branches is calculated by the following:

F_{k} = σ ({Conv}_{3 \times 3}^{d = k} (X)), k \in {1,2, 4}

(1)

where

σ (\cdot)

is the ReLU activation and denotes the dilation rate.

The concatenated feature is as follows:

F_{concat} = [F_{1}, F_{2}, F_{4}]

(2)

The channel attention weight vector

α \in R^{1 \times 1 \times C^{'}}

is computed via squeeze-and-excitation:

α = σ_{s} (W_{2} \cdot R e L U (W_{1} \cdot G A P (F_{concat})))

(3)

where

s

represents the channel-wise attention weights computed via the SE block. The final enhanced feature output is as follows:

F_{ELFM} = {Conv}_{1 \times 1} (F_{concat} \otimes α)

(4)

where

\otimes

denotes channel-wise multiplication and is the number of concatenated channels.

ELFM plays a foundational role in PSHNet by ensuring that the heatmap, offset, and size branches receive feature inputs with strong spatial discriminability. The combination of multi-scale dilation and attention improves the robustness of the model in detecting small, low-contrast targets under noisy or cluttered conditions.

3.3. Position–Scale Heatmap Supervision with CIoU Regularization

Infrared small target detection presents a dual challenge: objects are often minute and visually ambiguous, demanding both dense spatial sensitivity and precise localization. Traditional anchor-based box regression offers geometric accuracy but suffers from sparse supervision and rigid templates. In contrast, segmentation-style heatmaps deliver dense supervision but lack center specificity and shape consistency.

To address this, PSHNet employs a hybrid strategy that combines the spatial flexibility of heatmap prediction with geometric refinement through offset and size regression, followed by CIoU-based regularization. This enables the model to first generate coarse center candidates via heatmaps, refine their positions and scales using learned regressors, and finally enforce alignment with ground-truth boxes using CIoU loss.

The structure of the multi-branch output head at the end of the decoder is shown in Figure 3. It consists of three parallel branches: a heatmap head for dense spatial localization, an offset head for offset refinement, and a size head for target scale estimation. These branches operate on the final decoded feature maps and share the same intermediate features to promote representation consistency.

The Heatmap Head outputs a single-channel spatial probability map indicating the likelihood of target centers. Ground-truth annotations are encoded as 2D Gaussians centered at each object location. To emphasize confident activations and suppress noisy regions, we adopt a modified focal loss:

L_{heat} = \{\begin{array}{l} (1 - {\hat{Y}}_{i, j})^{α} l o g ({\hat{Y}}_{i, j}), & if Y_{i, j} = 1 \\ (1 - Y_{i, j})^{β} ({\hat{Y}}_{i, j})^{α} l o g (1 - {\hat{Y}}_{i, j}), & otherwise \end{array}

(5)

where

{\hat{Y}}_{i, j}

and

Y_{i, j}

denote the predicted and ground-truth heatmap values at pixel

i, j

and

α

,

β

are modulating coefficients.

Due to spatial downsampling, heatmap peaks may not align perfectly with object centers. To correct this, the offset head predicts a 2D shift vector at each candidate location, enabling sub-pixel refinement. The loss is defined by

L_{offset} = \frac{1}{N} \sum_{i = 1}^{N} ‖ {\hat{o}}_{i} - o_{i}^{g t} ‖_{1}

(6)

where

{\hat{o}}_{i}

and

o_{i}

are the predicted and ground-truth offsets, and

N

is the number of positive locations.

The Size Head branch estimates the width and height of each target based on the refined center. Unlike anchor-based approaches, PSHNet directly regresses size without predefined templates. The corresponding L1 loss is as follows:

L_{size} = \frac{1}{N} \sum_{i = 1}^{N} ‖ {\hat{s}}_{i} - s_{i}^{g t} ‖_{1}

(7)

where

{\hat{s}}_{i}

and

s_{i}

represent the predicted and true sizes, respectively.

All three branches operate on shared decoder features and are jointly optimized by the following total supervision loss:

L_{s u p} = L_{heat} + λ_{off} \cdot L_{offset} + λ_{size} \cdot L_{size}

(8)

with default weights

λ_{o} = λ_{s} = 1.0

.

To further enforce box-level consistency, we introduce CIoU loss as a global refinement mechanism. Once bounding boxes are reconstructed from heatmap peaks, offset vectors, and predicted sizes, we compute the following:

L_{CIoU}^{(i)} = 1 - I o U ({\hat{b}}_{i}, b_{i}^{g t}) + \frac{ρ^{2} ({\hat{c}}_{i}, c_{i}^{g t})}{c^{2}} + α v

(9)

$ρ^{2} ({\hat{c}}_{i}, c_{i}^{g t})$ is the squared center distance;
$c^{2}$ is the squared diagonal length of the smallest enclosing box of ${\hat{b}}_{i}$ and $b_{i}^{g t}$ ;
$v$ measures aspect ratio inconsistency;
$α$ is a dynamically adjusted weight defined by the original CIoU formulation.

The CIoU loss thus simultaneously penalizes localization, scale, and aspect-ratio discrepancies, ensuring comprehensive geometric alignment. Unlike conventional detectors, where CIoU is treated as the primary localization objective, PSHNet adopts it solely as an auxiliary term:

L_{total} = L_{\sup} + λ_{CIoU} \cdot \frac{1}{N} \sum_{i = 1}^{N} L_{CIoU} (B_{i}, B_{i}^{g t}),

(10)

with

λ_{CIoU} = 1.0

by default.

The formulation first guides the network to localize targets with high spatial sensitivity through dense heat-map predictions and then refines their geometric structure via focused box-level feedback. This loss design supports PSHNet’s core objective of achieving high recall and precision under the low-signal conditions characteristic of infrared small-target scenes.

4. Experiment

4.1. Experimental Setup and Evaluation Metrics

To objectively evaluate the proposed PSHNet framework, comprehensive experiments were conducted on two widely used public benchmark datasets: NUDT-SIRST and IRSTD-1k. The NUDT-SIRST dataset contains a total of 1327 infrared images, each sized 256 × 256 pixels, which we divided into 995 training and 332 testing samples using a 3:1 ratio. Similarly, the IRSTD-1k dataset comprises 1001 infrared images and was split into 800 training and 201 testing samples following a standard 4:1 ratio, consistent with prior works. To standardize input dimensions while preserving object geometry, all images were resized to 256 × 256 using aspect-ratio-preserving interpolation and zero-padding. Data augmentation—including flipping, rotation, and brightness jittering—was subsequently applied to the fixed-size images, ensuring geometric consistency across inputs and labels. This sequence follows common practice in infrared small-target detection tasks.

The PSHNet architecture was implemented in PyTorch v1.13 and trained from scratch on an NVIDIA RTX 3090 GPU. We employed the AdaGrad optimizer with an initial learning rate of 0.05 and applied cosine annealing over 300 training epochs. A batch size of four was used, and gradient clipping was activated whenever the gradient norm exceeded a threshold of 10 to ensure training stability. The multi-branch output heads were trained using a combination of modified focal loss for the heatmap branch and L1 losses for the offset and size branches. Furthermore, a Complete IoU (CIoU) loss was incorporated as an auxiliary geometric regularization term applied to the reconstructed bounding boxes, promoting spatial alignment and shape consistency without overriding the dense heatmap supervision.

To assess detection performance, we adopted three widely accepted metrics: Intersection over Union (IoU) for pixel-level localization accuracy, Probability of Detection (Pd) for recall sensitivity, and False Alarm Rate (Fa) to quantify robustness against background noise and spurious responses.

4.2. Comparison with State-of-the-Art Methods

To further validate the effectiveness of PSHNet, we conducted comparative experiments against a range of representative infrared small-target detection algorithms, including both traditional approaches and state-of-the-art deep learning-based models. The traditional baselines include Top-Hat, a morphological filter that enhances bright regions, and IPI, a low-rank decomposition method that separates targets from structured backgrounds. Among learning-based methods, we compared with MDvsFA, which leverages adversarial learning to balance false alarms and missed detections, ALCNet, which applies local contrast attention to enhance small targets, and DNANet, which incorporates nested attention for deep feature fusion. In contrast, our proposed PSHNet unifies dense heatmap-based localization with geometric regularization via CIoU, reinforced by shallow feature enhancement through the ELFM module.

All models were retrained on the NUDT-SIRST and IRSTD-1k datasets using identical preprocessing procedures and data splits to ensure a fair and reproducible comparison. Table 1 summarizes the results across the three evaluation metrics. On IRSTD-1k, PSHNet achieves the highest IoU (66.47%) and lowest Fa (6.83 × 10⁻⁶), while maintaining a high Pd of 98.01%. On NUDT-SIRST, our method again leads with an IoU of 83.02% and a Pd of 98.94%, outperforming all baselines with a notably reduced false alarm rate of 14.08 × 10⁻⁶. These results demonstrate that PSHNet effectively captures weak, small-scale targets while suppressing background interference, validating the benefit of our hybrid design.

Although the Pd score on NUDT-SIRST reaches 98.94%, we emphasize that the official dataset split was strictly followed, and no image or annotation overlaps exist between training and testing. In addition, to assess performance robustness, we computed 95% confidence intervals via bootstrapping (1000 iterations) over the test set. The resulting intervals are: NUDT-SIRST—Pd: 98.94% (±0.68%), IoU: 83.02% (±0.91%); IRSTD-1k—Pd: 98.01% (±0.74%), IoU: 66.47% (±1.08%).

To further explore the balance between sensitivity and specificity, Figure 4 presents target-level receiver operating characteristic (ROC) curves on the NUDT-SIRST dataset. PSHNet consistently maintains superior detection probability across a wide range of false alarm thresholds, showing particular advantage in the low-Fa regime critical for practical surveillance applications.

To assess real-world deployability, we also report GFLOPs and inference time on both high-end and embedded platforms. For reference deployment, we converted PSHNet to TensorRT-FP16 and benchmarked it on an NVIDIA Jetson Orin Nano (6 C CPU @ 1.5 GHz, 512-core GPU). With an input resolution of 256 × 256 and batch size = 1, the network runs at 13.4 ms per image and contains 5.7 M parameters. CPU-only inference on an Intel i7-12700K yielded ~380 ms per image. While PSHNet is not designed with lightweight deployment as a primary goal, these results indicate that near-real-time deployment is still feasible, though future work may further optimize the architecture for size and speed if required. The results are summarized in Table 2, which compares model parameters, computational cost, and inference latency across platforms.

To assess robustness across varying object scales, a scale-wise analysis was conducted on the IRSTD-1k dataset by grouping targets into small (area ≤ 10 pixels), medium (10 < area ≤ 40), and large (area > 40). Table 3 reports detection results for each group. PSHNet outperforms DNANet and ALCNet across all scales, particularly in the small-target regime, confirming the benefits of our ELFM module and size-aware localization branches.

To complement the quantitative analysis, Figure 5 presents qualitative comparisons of detection results across multiple test cases from the IRSTD-1k dataset. PSHNet consistently achieves cleaner suppression of background clutter and more accurate localization of faint or small-scale targets. Red circles indicate true positives, green circles indicate missed detections, and yellow circles denote false alarms.

These results underscore PSHNet’s adaptability, making it suitable for real-world deployment in environments with diverse target characteristics and complex infrared scenes.

4.3. Internal Module Evaluation (Ablation Study)

To thoroughly analyze and validate the contribution of individual components within the proposed PSHNet framework, a detailed ablation study was conducted using the NUDT-SIRST dataset under consistent training and evaluation conditions. Specifically, the experiments aimed to quantify the impact of the following key components: the Enhanced Low-level Feature Module (ELFM), Offset and Size prediction heads, and the CIoU geometric regularization. The baseline model was defined as a standard encoder–decoder network with only the heatmap prediction branch supervised by the modified focal loss.

We systematically constructed five configurations to assess the incremental contributions of each component:

Baseline: U-Net encoder–decoder architecture with only the heatmap branch and modified focal loss supervision.
Baseline + ELFM: Baseline augmented with the Enhanced Low-level Feature Module to strengthen spatial details.
Baseline + Offset/Size Heads: Baseline enhanced with offset and size regression heads supervised by L1 loss.
Baseline + CIoU Regularization: Baseline incorporating CIoU loss to regularize geometric consistency of predictions.
Full Model (Proposed PSHNet): Complete integration of ELFM, offset/size heads, and CIoU regularization.

The results of the ablation experiments are summarized quantitatively in Table 2, highlighting improvements in Intersection over Union (IoU), Probability of Detection (Pd), and reduction in False Alarm Rate (Fa) as each component was added incrementally.

As indicated in Table 4, integrating ELFM alone significantly improved IoU from 78.16% to 80.43% and notably reduced the false alarm rate from 27.14 × 10⁻⁶ to 20.67 × 10⁻⁶, demonstrating the importance of preserving low-level spatial detail for infrared small-target detection. The addition of offset and size heads further refined center localization and bounding box regression accuracy, improving IoU to 80.91% and reducing Fa to 18.53 × 10⁻⁶. Furthermore, introducing the CIoU geometric regularization independently enhanced localization precision, boosting IoU to 81.47% and Pd to 98.02%, simultaneously achieving a lower Fa of 17.21 × 10⁻⁶.

Ultimately, the complete PSHNet configuration, which simultaneously incorporates ELFM, offset/size regression, and CIoU regularization, achieved the best overall performance, reaching an IoU of 83.02%, a Pd of 98.94%, and the lowest false alarm rate of 14.08 × 10⁻⁶. These results underscore a strong synergy among the designed components, confirming their collective importance in addressing the unique challenges of infrared small-target detection tasks.

To assess the impact and robustness of data augmentation, we conducted additional experiments by adjusting or removing key augmentation strategies. When all augmentation—including flipping, rotation, and contrast jittering—was disabled, the performance of PSHNet on NUDT-SIRST dropped from 83.02% to 78.31% in IoU and from 98.94% to 95.52% in Pd, highlighting the importance of augmentation in preventing overfitting and improving generalization. Additionally, when only the rotation jitter was intensified (from ±30° to ±60°), the Pd dropped slightly to 98.31% and IoU to 82.47%, indicating that the model remains robust under stronger geometric perturbations.

5. Conclusions and Discussion

This paper introduces PSHNet, a hybrid deep-learning framework designed for accurate and robust infrared small-target detection. By integrating dense heatmap supervision, sub-pixel offset and size regression, and CIoU-based geometric regularization, PSHNet effectively addresses key challenges such as weak thermal signals, small target sizes, and complex background clutter. The inclusion of the Enhanced Low-level Feature Module (ELFM) further enhances the model’s sensitivity to fine spatial details. Extensive experiments on two benchmark datasets confirm the effectiveness and robustness of the proposed method. Inference tests on embedded platforms further demonstrate the model’s applicability for near real-time deployment.

While PSHNet demonstrates strong performance under diverse infrared imaging conditions, several limitations remain. First, the model contains 5.7 million parameters and is trained with only a few thousand labeled samples, which may risk overfitting in data-scarce scenarios. Although early stopping was not triggered in our experiments, and no severe overfitting was observed, future work may explore parameter-efficient variants or pruning strategies to improve training-data efficiency. Second, detection relies on thresholding the predicted heatmap responses; although this is standard practice in regression-based detectors, confidence calibration is implicit. Practical deployment can adjust this threshold based on ROC analysis or application-specific Pd–Fa trade-offs.

Moreover, failure cases occasionally arise under extreme low-SNR conditions, partial occlusions, or thermal clutter with structured backgrounds. These scenarios reveal the limitations of purely spatial representations and suggest that additional contextual or temporal information may be beneficial. While the current framework is trained under fully supervised settings, generalization may be further limited in environments with sparse or noisy annotations.

To address these limitations, future work will explore strategies to enhance adaptability and scalability. Potential directions include the following:

Semi-supervised or weakly supervised learning, to reduce dependency on dense labels;
Model compression, such as replacing deeper convolutional blocks with ghost feature modules [24], to support deployment on low-power platforms;
Lightweight attention mechanisms [25], to better capture subtle thermal features with minimal computational cost;
Temporal modeling or multimodal fusion, such as visible-infrared integration, to improve robustness under motion blur or complex scenes.

Expanding evaluation to real-world video sequences and additional sensor modalities will also be essential steps toward practical and reliable deployment.

Author Contributions

Conceptualization, W.C.; methodology, W.C.; software, W.C.; formal analysis, C.Z.; writing—original draft preparation, W.C.; writing—review and editing, Y.L.; supervision, C.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the results of this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bai, X.; Zhou, F.; Jin, T. Enhancement of dim small target through modified top-hat transformation under the condition of heavy clutter. Signal Process. 2010, 90, 1643–1654. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. SPIE 1999, 3809, 74–83. [Google Scholar]
Zhou, D.; Wang, X. Robust infrared small target detection using a novel four-leaf model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1462–1469. [Google Scholar] [CrossRef]
Guan, X.; Peng, Z.; Huang, S.; Chen, Y. Gaussian scale-space enhanced local contrast measure for small infrared target detection. IEEE Geosci. Remote Sens. Lett. 2019, 17, 327–331. [Google Scholar] [CrossRef]
Zhang, J.; Liu, P.; Xie, J.; Li, M. Infrared small target detection based on Gaussian curvature filtering and partial sum of singular values. SPIE 2022, 12506, 928–934. [Google Scholar]
Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
Hou, X.; Zhang, L. Saliency detection: A spectral residual approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Yan, Z.; Xin, Y.; Su, R.; Liang, X.; Wang, H. Multi-Scale Infrared Small Target Detection Method via Precise Feature Matching and Scale Selection Strategy. IEEE Access 2020, 8, 48660–48672. [Google Scholar] [CrossRef]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior attention-aware network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Ciocarlan, A.; Le Hegarat-Mascle, S.; Lefebvre, S.; Woiselle, A.; Barbanson, C. A Contrario Paradigm for Yolo-Based Infrared Small Target Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5630–5634. [Google Scholar]
Hao, X.; Luo, S.; Chen, M.; He, C.; Wang, T.; Wu, H. Infrared small target detection with super-resolution and YOLO. Opt. Laser Technol. 2024, 177, 111221. [Google Scholar] [CrossRef]
Li, N.; Huang, S.; Wei, D. Infrared Small Target Detection Algorithm Based on ISTD-CenterNet. Comput. Mater. Contin. 2023, 77, 3511–3531. [Google Scholar] [CrossRef]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Liu, R.; Zheng, B.; Wang, H.; Fu, Y. Infrared small target detection with scale and location sensitivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17490–17499. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Qi, M.; Liu, L.; Zhuang, S.; Liu, Y.; Li, K.; Yang, Y.; Li, X. FTC-Net: Fusion of transformer and CNN features for infrared small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8613–8623. [Google Scholar] [CrossRef]
Chen, T.; Tan, Z.; Chu, Q.; Wu, Y.; Liu, B.; Yu, N. Tci-former: Thermal conduction-inspired transformer for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 1201–1209. [Google Scholar]
Ciocarlan, A.; Hégarat-Mascle, S.L.; Lefebvre, S.; Woiselle, A. Robust infrared small target detection using self-supervised and a contrario paradigms. arXiv 2024, arXiv:2410.07437. [Google Scholar]
Ni, R.; Wu, J.; Qiu, Z.; Chen, L.; Luo, C.; Huang, F.; Liu, Q.; Wang, B.; Li, Y.; Li, Y. Point-to-Point Regression: Accurate Infrared Small Target Detection With Single-Point Annotation. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 1–19. [Google Scholar] [CrossRef]
Zhang, Q.; Qiu, L.; Zhou, L.; An, J. ESM-YOLO: Enhanced Small Target Detection Based on Visible and Infrared Multi-modal Fusion. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 1454–1469. [Google Scholar]
Liu, Q.; Li, X.; Yuan, D.; Yang, C.; Chang, X.; He, Z. LSOTB-TIR: A large-scale high-diversity thermal infrared single object tracking benchmark. IEEE Trans. Neural Networks Learn. Syst. 2023, 35, 9844–9857. [Google Scholar] [CrossRef] [PubMed]
Hayat, M.; Aramvith, S.; Bhattacharjee, S.; Ahmad, N. Attention ghostunet++: Enhanced segmentation of adipose tissue and liver in ct images. arXiv 2025, arXiv:2504.11491. [Google Scholar]
Hayat, M.; Gupta, M.; Suanpang, P.; Nanthaamornphong, A. Super-Resolution Methods for Endoscopic Imaging: A Review. In Proceedings of the 2024 12th International Conference on Internet of Everything, Microwave, Embedded, Communication and Networks (IEMECON), Jaipur, India, 24–26 October 2024; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed PSHNet framework.

Figure 2. Customized SE Block in PSHNet-ELFM.

Figure 3. Structure of the multi-branch prediction heads and bounding box generation.

Figure 4. Target-level ROC curves on the NUDT-SIRST dataset. Our method consistently achieves a better balance between Pd and Fa across a wide range of confidence thresholds.

Figure 5. Visual comparison of detection results under varying conditions. Each row corresponds to a test case from the IRSTD-1k dataset. From left to right: Input Image, Ground Truth, Top-Hat, IPI, MDvsFA, ALCNet, DNANet, and PSHNet (Ours). Red circles indicate true targets, green circles indicate missed detections, and yellow circles indicate false alarms.

Table 1. Detection performance comparison on NUDT-SIRST and IRSTD-1k datasets. ↑ indicates higher values are better; ↓ indicates lower values are better.

Method	Dataset	IoU (%) ↑	Pd (%) ↑	Fa (×10⁻⁶) ↓
Top-Hat	IRSTD-1k	12.56	68.94	188.78
IPI		26.01	70.39	36.69
MDvsFA		52.88	86.90	31.32
ALCNet		48.02	93.30	29.87
DNANet		63.01	97.70	9.72
Ours		66.47	98.01	6.83
Top-Hat	NUDT-SIRST	22.34	70.54	95.37
IPI		18.67	67.87	45.78
MDvsFA		52.84	82.34	50.55
ALCNet		78.45	96.34	35.99
DNANet		82.17	98.93	23.65
Ours		83.02	98.94	14.08

Table 2. Comparison of model complexity and inference latency.

Method	Parameters (M)	GFLOPS	Latency (RTX 3090, ms)	Latency (Orin Nano, ms)
ALCNet	1.50	4.2	6.63	17.8
MDvsFA	3.13	7.5	9.66	22.4
DNANet	4.69	10.8	24.05	35.6
Ours	5.7	12.3	13.4	28.7

Table 3. Scale-wise detection performance comparison on IRSTD-1k (target area in pixels). ↑ indicates higher values are better; ↓ indicates lower values are better.

Method	Target Size	IoU (%) ↑	Pd (%) ↑	Fa (×10⁻⁶) ↓
ALCNet	(0, 10] (small)	47.26	91.27	22.53
DNANet		49.51	95.24	20.68
Ours		52.08	96.12	17.42
ALCNet	(10, 40] (medium)	63.14	91.85	13.25
DNANet		64.79	91.85	8.03
Ours		66.37	93.16	6.51
ALCNet	(40, ∞) (large)	78.46	93.94	6.78
DNANet		79.20	96.97	11.02
Ours		80.91	97.34	9.06

Table 4. Quantitative results of the ablation study on the NUDT-SIRST dataset. ↑ indicates higher values are better; ↓ indicates lower values are better.

Configuration	IoU (%) ↑	Pd (%) ↑	Fa (×10⁻⁶) ↓
Baseline	78.16	96.85	27.14
Baseline + ELFM	80.43	97.32	20.67
Baseline + Offset/Size Heads	80.91	97.84	18.53
Baseline + CIoU Regularization	81.47	98.02	17.21
Full Model (Proposed PSHNet)	83.02	98.94	14.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, W.; Zhang, C.; Liu, Y. PSHNet: Hybrid Supervision and Feature Enhancement for Accurate Infrared Small-Target Detection. Appl. Sci. 2025, 15, 7629. https://doi.org/10.3390/app15147629

AMA Style

Chen W, Zhang C, Liu Y. PSHNet: Hybrid Supervision and Feature Enhancement for Accurate Infrared Small-Target Detection. Applied Sciences. 2025; 15(14):7629. https://doi.org/10.3390/app15147629

Chicago/Turabian Style

Chen, Weicong, Chenghong Zhang, and Yuan Liu. 2025. "PSHNet: Hybrid Supervision and Feature Enhancement for Accurate Infrared Small-Target Detection" Applied Sciences 15, no. 14: 7629. https://doi.org/10.3390/app15147629

APA Style

Chen, W., Zhang, C., & Liu, Y. (2025). PSHNet: Hybrid Supervision and Feature Enhancement for Accurate Infrared Small-Target Detection. Applied Sciences, 15(14), 7629. https://doi.org/10.3390/app15147629

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PSHNet: Hybrid Supervision and Feature Enhancement for Accurate Infrared Small-Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Traditional Infrared Small-Target Detection Methods

2.2. Deep Learning-Based Infrared Small-Target Detection Methods

2.3. Limitations of Existing Methods

3. Proposed Method

3.1. Overall Architecture

3.2. Enhanced Low-Level Feature Module (ELFM)

3.3. Position–Scale Heatmap Supervision with CIoU Regularization

4. Experiment

4.1. Experimental Setup and Evaluation Metrics

4.2. Comparison with State-of-the-Art Methods

4.3. Internal Module Evaluation (Ablation Study)

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI