HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement

Guo, Fugui; Chen, Pan; Zhao, Weiwei; Wang, Weichao

doi:10.3390/automation6040086

Open AccessArticle

HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement

by

Fugui Guo

¹,

Pan Chen

^1,*,

Weiwei Zhao

¹ and

Weichao Wang

²

¹

College of Information and Communication, National University of Defense Technology (NUDT), Wuhan 430010, China

²

College of Optoelectronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Automation 2025, 6(4), 86; https://doi.org/10.3390/automation6040086 (registering DOI)

Submission received: 23 September 2025 / Revised: 16 November 2025 / Accepted: 24 November 2025 / Published: 2 December 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Infrared small target detection has become a research hotspot in recent years. Due to the small target size and low contrast with the background, it remains a highly challenging task. Existing infrared small target detection algorithms are generally implemented on 8-bit low dynamic range (LDR) images, whereas raw infrared sensing images typically possess a 14–16 bit high dynamic range (HDR). Conventional HDR image enhancement methods do not consider the subsequent detection task. As a result, the enhanced LDR images often suffer from overexposure, increased noise levels with higher contrast, and target distortion or loss. Consequently, discriminative features in HDR images that are beneficial for detection are not effectively exploited, which further increases the difficulty of small target detection. To extract target features under these conditions, existing detection algorithms usually rely on large parameter models, leading to an unsatisfactory trade-off between efficiency and accuracy. To address these issues, this paper proposes a novel infrared small target detection framework based on HDR image enhancement (HDR-IRSTD). Specifically, a multi-branch feature extraction and fusion mapping subnetwork (MFEF-Net) is designed to achieve the mapping from HDR to LDR. This subnetwork effectively enhances small targets and suppresses noise while preserving both detailed features and global information. Furthermore, considering the characteristics of infrared small targets, an asymmetric Vision Mamba U-Net with multi-level inputs (AVM-Unet) is developed, which captures contextual information effectively while maintaining linear computational complexity. During training, a bilevel optimization strategy is adopted to collaboratively optimize the two subnetworks, thereby yielding optimal parameters for both HDR infrared image enhancement and small target detection. Experimental results demonstrate that the proposed method achieves visually favorable enhancement and high-precision detection, with strong generalization ability and robustness. The performance and efficiency of the method exhibit a well-balanced trade-off.

Keywords:

infrared small target; enhancement; detection; State Space Model

1. Introduction

Infrared detection, which passively receives infrared radiation emitted from targets or backgrounds, is characterized by a long detection range, weather independence, strong concealment, and high resistance to interference. It has therefore been widely applied in infrared early warning and guidance systems [1]. However, in long-range imaging, targets typically occupy only a few to several dozen pixels on the imaging plane, resulting in extremely low contrast and signal-to-noise ratio, and thus exhibit weak and small target characteristics. Infrared sensing images reveal temperature differences between the target and its local background. To faithfully represent observational details, they are typically captured in a 14–16 bit HDR [2,3,4], which preserves richer details compared with LDR images. Nevertheless, existing publicly available infrared small target detection datasets are all based on LDR images, and current deep learning-based infrared small target detection algorithms are designed specifically for LDR inputs. Conventional HDR image enhancement methods, however, do not account for the subsequent detection task. When HDR images are compressed into the LDR domain, significant details are lost, and discriminative features beneficial for detection are not effectively preserved. As a result, the enhanced images often fail to highlight targets distinctly and are less suitable for human observation. In contrast, detection-driven enhancement approaches generate LDR images in which noise is effectively suppressed, where contrast and target features are substantially enhanced, making the images more favorable for human inspection as intermediate results and simultaneously improving detection accuracy when used as inputs to detection networks.

Current HDR infrared image enhancement algorithms include automatic gain control (AGC) [5], histogram equalization (HE) [6] and its variants [7,8], Gamma correction [9], logarithmic transformation [10], and Retinex-based methods [11]. These traditional algorithms rely on manually designed parameters, have poor scene adaptability, and often suffer from overexposure, noise levels increasing with contrast enhancement, and significant loss of small target features, making the detection of these targets even more challenging.

To better extract features of infrared small targets, researchers have designed various networks. Wang et al. proposed IAA-Net [12], which first performs coarse detection to filter target regions, and then uses a Transformer encoder to model the features of the coarse target region. Li et al. proposed DNA-Net [13], using a densely nested attention network to enhance information retention for small and medium targets in deep layers. Wu et al. introduced UIU-Net [14], which integrates a compact U-Net into a larger U-Net backbone to prevent the loss of small target information. Although these algorithms achieve good detection accuracy, they rely on dense connections and often have large parameter sizes, leading to poor real-time performance. In practical engineering applications, various detection systems require quick responses. Therefore, achieving a good balance between detection performance and efficiency is crucial.

To address the aforementioned challenges, we propose a small target detection framework based on HDR infrared image enhancement. This framework aims to integrate HDR infrared image enhancement with the small target detection task, achieving visually friendly enhancement while ensuring high detection accuracy and real-time performance.

Specifically, we first design a multi-branch feature extraction and fusion subnetwork (MFEF-Net) to achieve the mapping from HDR to LDR. Residual connections and receptive field enlargement have been proven effective in improving denoising capability [15,16,17]. Unlike conventional layer-decomposition approaches [18,19] or dual-branch structures such as DBS-DCN [20], which separate images into base and detail layers for individual processing, our method employs sequential dilated convolutions with varying dilation rates to extract and fuse features at multiple levels. This design enlarges the receptive field while retaining finer details and suppressing noise. At the core of the network, a channel- and spatial-attention module with residual connections (Res_CSAM) further enhances feature extraction and denoising ability. Inspired by the success of VM-UNet [21] in medical image segmentation, we design an asymmetric Vision Mamba U-Net tailored for small target detection. In object detection and segmentation tasks, higher-resolution inputs typically improve accuracy but also increase computational cost, thereby reducing efficiency. A common practice in Vision Transformers (ViT) [22] is to enlarge the patch size to lower the computational overhead and improve efficiency. However, for small targets, overly coarse feature granularity may lead to target loss. To address this, we introduce a cascaded patch embedding strategy with different patch size as inputs to the asymmetric Vision Mamba U-Net. This approach preserves deep-level target features with only marginal computational overhead, significantly boosting detection accuracy. The proposed network is built upon the Visual State Space Model (VSSM), enabling effective contextual information capture while maintaining linear computational complexity, thereby achieving efficient and accurate detection. In addition, we derive a cooperative training strategy to collaboratively learn the optimal parameters of the two subnetworks. Compared with state-of-the-art (SOTA) methods, our approach achieves higher detection accuracy, lower false-alarm rates, and improved detection efficiency. Moreover, the enhanced HDR infrared images generated by our framework exhibit superior visual quality, with noise effectively suppressed and targets distinctly highlighted. The main contributions of this paper are as follows:

(1): A small target detection framework based on HDR infrared image enhancement (HDR-IRSTD-Net) is designed, achieving detection-oriented HDR infrared image enhancement and detection based on HDR infrared images.
(2): A cooperative training scheme is proposed, using cooperative optimization to combine HDR infrared image enhancement and small target detection, allowing both the enhancement and detection networks to reach optimal parameters, resulting in better image enhancement visual effects and higher detection accuracy.
(3): By analyzing the dynamic range of raw images from various imaging backgrounds and target types across long-wave, mid-wave, and short-wave infrared sensors, we generated 16-bit images with multiple dynamic ranges based on the NUDT-SIRST and SIRST datasets and created the NUDT-SIRST-16bit dataset, solving the current issue of lacking HDR infrared small target datasets. By analyzing the dynamic range of raw images from infrared sensors with different wavelengths for various imaging backgrounds and target types, we generated 16-bit images with multiple dynamic ranges based on the NUDT-SIRST and SIRST datasets. This led to the creation of the NUDT-SIRST-16bit and SIRST-16bit datasets, effectively addressing the current lack of HDR infrared weak target datasets.
(4): Comparative experiments with other algorithms on the NUDT-SIRST-16bit and SIRST-16bit datasets show that the proposed framework outperforms other algorithms in metrics such as Pd, Fa, and mIoU, demonstrating strong application potential.

The rest of the paper is organized as follows: Section 2 briefly reviews related research work; Section 3 details the proposed HDR infrared image enhancement-based small target detection framework; Section 4 introduces the implementation details of the experiments, presents visualization results, and performs quantitative analysis; and finally, Section 5 concludes the paper.

2. Related Work

HDR infrared image enhancement and small target detection have long been among the most challenging research areas. Numerous scholars have made significant contributions to these fields. This section briefly reviews the latest research progress and the findings that are of crucial relevance to our work, providing an important foundation for our study.

2.1. Infrared Image Enhancement and Dynamic Range Compression

Currently, HDR infrared image enhancement is primarily dominated by traditional methods. AGC [5] establishes a linear mapping between high and low threshold levels to generate an 8-bit image, but this process often leads to loss of details. HE [6] uses the cumulative distribution function of the original image to produce a target image with uniformly distributed gray levels. However, it tends to amplify noise and degrade image details, which led to the proposal of Piecewise Histogram Equalization (PHE) [7], where the original histogram is thresholded to distribute gray levels more reasonably. These methods aim to establish a one-to-one mapping between input and output pixel intensities from a global perspective. While these algorithms are simple and have good real-time performance, they fail to account for the intensity differences between adjacent pixels and perform poorly in enhancing local contrast and preserving image details. To address these issues, Contrast Limited Adaptive Histogram Equalization (CLAHE) [8] was proposed. It works well in enhancing edges and local contrast without amplifying noise. However, it performs poorly in homogeneous regions. Methods based on Retinex Theory [11] have been widely applied due to their nonlinear operations, which effectively preserve image details and suppress noise. The core of this theory is to use low-pass filtering functions in convolution with the original image to estimate the image’s underexposed, well-exposed, and overexposed regions, thus deriving three illumination components. The Single-Scale Retinex (SSR) algorithm [23], due to its simple parameters, does not strike a good balance between dynamic range compression and contrast enhancement. To address this, Multi-Scale Retinex (MSR) [24] was proposed, which works better but produces halos in regions with large brightness differences and does not significantly improve the details of bright regions. Furthermore, numerous scholars have extensively studied guided filtering and layered theory. By applying guided filtering [25,26,27,28], the base and detail layers of the original image are extracted and then enhanced and recombined. This method yields good detail enhancement but is prone to halo artifacts and gradient reversal. Other dynamic range compression algorithms, such as Gamma Correction [9] and Logarithmic Transformation [10], suffer from poor adaptability and often rely heavily on manual parameter tuning.

In recent years, deep learning-based methods have achieved significant success in low-level vision tasks and have also been applied to HDR infrared image visualization, detail enhancement, and denoising. FFDNet [15] expands the receptive field to improve denoising performance. DBS-DCN [20] is a deep convolutional neural network (CNN) with a dual-branch structure that effectively extracts the structural features of infrared stripe noise while preserving the detail information of the infrared image. RIDNet [16] adopts an enhanced attention module to provide a wide receptive field. Lee et al. [17] proposed a convolutional neural network-based infrared image enhancement algorithm that significantly improves enhancement performance and convergence speed by incorporating residual learning techniques.

2.2. Single-Frame Infrared Small Target Detection

Traditional infrared small target detection algorithms can be categorized into three types: filter-based methods [29,30,31,32], which are computationally simple but only suitable for uniform backgrounds, with limited performance in complex backgrounds; local contrast-based methods [33,34,35,36,37,38], which are easy to implement but susceptible to background interference; and low-rank-based methods [39,40,41,42,43,44], which can handle complex situations but are computationally expensive and have high false-alarm rates in infrared images of dark targets. In summary, traditional methods rely on handcrafted feature design, making them suitable for simple scenarios, but ineffective in complex real-world environments.

Deep learning-based methods have been widely applied in the field of computer vision, especially in recent years, where segmentation-based approaches have made significant progress in small target detection. Dai et al. proposed ALC-Net [45], which integrates local contrast priors into convolutional networks and uses a bottom–up attention mechanism to combine features from different layers. Zhao et al. introduced GGL-Net [46], employing a local contrast learning (LCL) structure for local contrast learning. During feature extraction, a gradient supplement module (GSM) that rationalizes the attention mechanism embeds the original gradient information into deeper layers of the network. They also proposed a bidirectional guided fusion module (TGFM) to facilitate multi-scale feature fusion. Zhang et al. presented AGPC-Net [47], which designs a context pyramid module (CPM) to adapt to infrared small targets, as well as an attention-guided context block (AGCB) to estimate the correlations between pixels within and between patches and highlight targets. Wang et al. proposed AFE-Net [1], which introduces an attention mechanism into the encoder and decoder layers, using cascaded non-local operations across different layers to filter out clutter that resembles infrared target features. Wang et al. introduced IAA-Net [12], which uses ResNet18 as the backbone to construct a Region Proposal Network (RPN) that generates coarse target regions and filters out backgrounds. It then utilizes a Transformer-based encoder to model the features of the perceived coarse target regions. Li et al. proposed DNA-Net [14], which integrates multiple U-shaped subnetworks and establishes connections between the encoder and decoder subnetworks to enhance the retention of information for medium and small targets in deeper layers. Wu et al. proposed UIU-Net [14], integrating a compact U-Net into a larger U-Net backbone, preventing the loss of small target information while avoiding classification backbone networks. Chung et al. introduced AMFU-Net [48], which uses full-size skip connections based on UNet3+ [49] to avoid nested structures. RepISD-Net [50] reduces computational costs and achieves lightweight detection through different network architectures. Wang et al. proposed RLPGB-Net [51], which combines reinforcement learning with object detection to highlight target features and introduces a boundary attention module (GB) to enhance infrared small target detection. While these algorithms have made remarkable contributions to improving detection performance or efficiency, they fail to achieve a good balance between performance and efficiency. Moreover, they are all designed for LDR infrared images and cannot detect HDR infrared images.

2.3. State Space Models and Their Advantages in the Field of Target Segmentation

The algorithms based on convolutional neural networks (CNNs) present significant limitations in sequential modeling, while approaches relying on Transformers face the issue of quadratic computational complexity [52,53]. The State Space Models (SSMs) [54,55,56,57], represented by Mamba [54], have garnered considerable attention from researchers. SSMs maintain linear computational complexity and demonstrate outstanding modeling capabilities, leading to extensive research across numerous fields. Furthermore, MambaOut [52] has further substantiated the advantages of Mamba in long sequence modeling and has shown that detection and segmentation tasks also exhibit characteristics of long sequence modeling, thereby making Mamba suitable for these tasks. VM-UNet [21] established a purely SSM-based model, demonstrating the potential of Mamba in medical image segmentation.

3. Proposed Method

This section provides a detailed introduction to our approach. Firstly, we present an overview of the proposed framework for detecting small targets based on HDR infrared image enhancement. Subsequently, we elaborate on the two crucial constituent subnetworks within this framework. Finally, we abstract the cooperative optimization problem, formulating it as a bilevel optimization problem involving low-level and high-level vision, and outline the cooperative training strategy.

3.1. Overview of the Detection Framework Based on HDR Infrared Image Enhancement

The framework we proposed consists of two subnetworks, MFEF-Net and AVM-Unet. MFEF-Net is designed to achieve the feature mapping from HDR to LDR, while AVM-Unet is responsible for extracting features of small targets and outputting segmentation results. As shown in Figure 1, the framework takes HDR infrared images as input, obtains LDR images through the MFEF-Net network, and directly uses them for human inspection and the target detection task. The reconstructed LDR images are then processed by AVM-Unet, ultimately yielding a segmentation mask. Additionally, the gradients of both detection loss and mapping loss are backpropagated as a whole to simultaneously train and update the parameters of the two networks.

3.2. MFEF-Net

As illustrated in Figure 2a, the MFEF-Net architecture consists of an input layer, a Multi-scale Feature Extraction and Fusion module (MFEF), and an output layer. Both the input and output layers employ a convolutional module composed of a

3 \times 3

convolution, batch normalization, and a GELU activation function to extract low-level features from the input image and perform nonlinear correction. The MFEF module adopts a Feature Pyramid Network (FPN) structure design, where each branch consists of a channel and spatial attention module with residual connections (Res_CSAM). Each module contains dilated convolutions with dilation rates set sequentially as 1, 2, 3, and 4. The input features pass through each module to obtain the high-frequency, sub-high-frequency, sub-low-frequency, and low-frequency components of the image, denoted as

Y_{i}

:

Y_{i} = \{\begin{matrix} Res_{CSAM}_{dilation = i} (X) & i = 1 \\ Res_{CSAM}_{dilation = i} (X + Y_{i - 1}) & i = 2, 3, 4 \end{matrix}

(1)

The convolution module of the i Res_CSAM has a dilation rate of i. The use of a sequential dilated convolutional structure enables the network to capture global information while maximally preserving fine-grained image features, which facilitates the enhancement of small targets and suppresses noise.

Finally, the features from each level are aggregated to generate the new feature representation

F_{t}

:

F_{t} = \sum_{i = 1}^{n} Y_{i} i \in {1, 2, 3, 4}

(2)

The Res_CSAM block, shown in Figure 2b, is an improvement to the CBAM. It consists of a channel attention module, a spatial attention module, and a residual structure, which mitigates the potential information loss and training difficulties caused by the attention modules through the use of a residual design. The channel attention module can be expressed as

X^{'} = σ [M L P (P_{a v g} (X)) + M L P (P_{m a x} (X))] \otimes X

(3)

where

σ

is the Sigmoid function and ⊗ is the element-by-element multiplication. The spatial attention module can be represented as

X^{″} = σ [f^{3 \times 3} (P_{a v g} (X^{'}); P_{m a x} (X^{'}))] \otimes X^{'}

(4)

where

σ

is the Sigmoid function and

f^{3 \times 3}

denotes the

3 \times 3

convolution.

3.3. AVM-UNet

As illustrated in Figure 3a, we cascade three patch embedding branches with different patch size as inputs, and design an asymmetric Vision Mamba Unet as the detection network. This design significantly enhances the performance of detecting small targets while adding only a minimal computational overhead.

Specifically, the image

X \in R^{H \times W \times 3}

is divided into non-overlapping patches using patch size of

1 \times 1

,

2 \times 2

, and

4 \times 4

, and the dimensions are mapped to

C / 4

,

C / 2

, and C, respectively. Through experiments, the value of C is set to 64. The patches are then normalized using layer normalization to obtain

X_{1} \in R^{H \times W \times \frac{C}{4}}

,

X_{2} \in R^{\frac{H}{2} \times \frac{W}{2} \times \frac{C}{2}}

, and

X_{4} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

. During the encoding phase, each layer uses one Visual State Space (VSS) block for encoding, which halves the height and width of the features while doubling the channel dimension. First,

X_{1}

is encoded using a VSS block to produce

{X^{'}}_{1} \in R^{\frac{H}{2} \times \frac{W}{2} \times \frac{C}{2}}

, which is then added to

X_{2}

. This process is repeated by encoding

X_{1}^{'} + X_{2}

to obtain

{X^{'}}_{2} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

, which is added to

X_{4}

. Subsequently, a fully symmetric structure is employed for encoding and decoding. In the subsequent encoding process, the procedure is divided into four stages, each using one VSS block, with the corresponding channel numbers being [C, 2C, 4C, 8C].

The decoding process is also divided into four stages, with each layer using one VSS block for decoding. In the last three stages, a patch expanding operation is applied to increase the height and width of the features by half while reducing the channel dimension by half. Finally, a 4 times upsampling operation is used to recover the features to their original height and width, followed by a

1 \times 1

convolution to restore the channel dimension. To avoid introducing additional parameters, skip connections are implemented using direct summation.

The VSS block is the core module of the AVM-UNet, fully following the design of Vmamba [53] and VM-UNet [21] as shown in Figure 3b. For the input

Z \in R^{B \times H \times W \times C}

, after applying layer normalization, the input is divided into two separate branches,

Z_{1} \in R^{B \times H \times W \times \frac{C}{2}}

and

Z_{2} \in R^{B \times H \times W \times \frac{C}{2}}

. In the first branch,

Z_{1}

is processed through a linear layer followed by an activation function, yielding

Z_{1}^{'} = S i L U (L i n e a r (Z_{1}))

. In the second branch,

Z_{2}

is passed through a linear layer, a depthwise separable convolution, and an activation function to obtain

{Z^{'}}_{2} = S i L U (D W C o n v (L i n e a r (Z_{2})))

. Subsequently,

Z^{'}

is fed into a 2D Selective Scan (SS2D) module for deeper feature extraction. Then, layer normalization is applied, and the outputs of the two branches are merged through element-wise addition. Finally, these features are mixed through a linear layer, and the final result is combined with a residual connection to form the output of the VSS block.

In the SS2D unit,

Z_{2}^{'}

is first unfolded into a sequential representation through the cross-scan operation, constructing sequences

S_{k} \in R^{B \times L \times D} k \in {1, 2, 3, 4}

(B, L, and D respectively represent batch size, token length, and dimension) in four directions: top-left to bottom-right, bottom-left to top-right, and their respective reverse sequences. For each sequence

S_{k}

, a projection with shared shape but independent parameters is applied to obtain the tensor

(Δ, B, C)

. Subsequently, a selective scan is performed on each sequence

S_{k}

to propagate information across the state space, allowing features to be transmitted throughout the entire image feature map and producing four directional outputs,

Y_{k}

.

Y_{k} = S e l e c t i v e S c a n (S_{k}, Δ, A, B, C, D) k \in {1, 2, 3, 4}

(5)

where

A

and

D

denote learnable parameters.

Afterward, the reverse sequences generated by the selective scan are rearranged back to their forward order, and the vertical sequences are transposed to align within the same spatial layout. The aligned outputs are then summed to obtain the final feature representation

Y = Y_{1} + Y_{2} + Y_{3} + Y_{4}

. Algorithms 1 and 2 present the pseudocode for the SS2D and Selective Scan operations, respectively.

Algorithm 1 SS2D

Require:: $Z_{2}^{'} \in R^{B \times H \times W \times \frac{C}{2}}$

1:: $S_{k} \in R^{B \times L \times D} k \in {1, 2, 3, 4} \leftarrow MakeSequences (Z_{2}^{'})$ ▹ top-left →
bottom-right, bottom-left → top-right, and their reverses
2:: for $k = 1$ to 4 do
3:: $Δ, B, C \leftarrow Linear (S_{k}), Linear (S_{k}), Linear (S_{k})$ ▹ shared shape, independent weights
4:: $Y_{k} \leftarrow SelectiveScan (S_{k}, Δ, A, B, C, D)$
5:: $Y_{k} \leftarrow AlignBack (Y_{k})$ ▹ flip-back / transpose to $(H, W)$ layout
6:: end for
7:: $Y \leftarrow \sum_{k = 1}^{4} Y_{k}$
8:: return Y

Algorithm 2 SelectiveScan in SS2D

Require:: Sequence $S \in R^{B \times L \times D}$ , $Δ$ , parameters $A, B, C, D$

1:: Initialize state $h_{0} \leftarrow 0$
2:: $\bar{A} \leftarrow e x p (Δ A)$
3:: $\bar{B} \leftarrow {(Δ A)}^{- 1} (exp (Δ A) - I) \cdot Δ B$
4:: for $ℓ = 1$ to L do
5:: $h_{ℓ} \leftarrow \bar{A} h_{ℓ - 1} + \bar{B} S_{ℓ}$
6:: $y_{ℓ} \leftarrow C h_{ℓ} + D S_{ℓ}$
7:: end for
8:: return $Y_{k} = {y_{ℓ}}_{ℓ = 1}^{L}$

3.4. Loss Functions and Cooperative Training Strategy

3.4.1. Loss Function

HDR infrared image enhancement loss function. Since the MSE loss focuses solely on the low-level information of the mapped image through pixel-wise comparison, it does not adequately capture the perceptual quality that aligns with human visual perception. To address this, we adopt a hybrid loss function

L_{m a p}

, which combines the semantic perceptual loss based on the VGG19 network [58] and the multi-scale structural similarity loss. The formulation is presented below:

L_{m a p} = λ_{1} L_{P e r c e p t u a l} + (1 - λ_{1}) L_{M S - S S I M}

(6)

included among these

L_{P e r c e p t u a l} = \frac{1}{N} \sum_{i = 1}^{N} {(F_{i} (x) - F_{i} (y))}^{2}

(7)

where x represents the LDR ground truth image and y denotes the reconstructed image.

F_{i} (x)

and

F_{i} (y)

correspond to their respective feature representations in the i layer of the pretrained VGG19 network, with N representing the total number of feature layers:

L_{M S - S S I M} = 1 - \prod_{m = 1}^{M} {(\frac{2 μ_{x} μ_{y} + c_{1}}{μ_{x}^{2} + μ_{y}^{2} + c_{1}})}^{β_{m}} {(\frac{2 σ_{x y} + c_{2}}{σ_{x}^{2} + σ_{y}^{2} + c_{2}})}^{γ_{m}}

(8)

where M represents the number of different scales of the image,

μ_{x}

signifies the mean of the LDR ground truth image,

σ_{x}^{2}

denotes the variance of the LDR ground truth image,

μ_{y}

stands for the mean of the reconstructed image,

σ_{y}^{2}

indicates the variance of the reconstructed image,

σ_{x y}

is the covariance of the LDR ground truth image and the reconstructed image,

β_{m}

and

γ_{m}

symbolize the relative weights of the two components, and

c_{1}

and

c_{2}

are constants introduced to prevent division by zero. The trade-off hyperparameter

λ_{1}

is empirically determined as 0.7 in this study.

Infrared small target detection loss function. Soft-IoU Loss is employed as the loss function for infrared small target detection, defined as

L_{S o f t - I o U} = \frac{\sum_{i, j} P_{i, j} * Y_{i, j}}{\sum_{i, j} (P_{i, j} + Y_{i, j} - P_{i, j} * Y_{i, j})}

(9)

where

Y_{i, j}

represents the ground truth mask and

P_{i, j}

denotes the prediction mask generated by the network.

3.4.2. Cooperative Training Strategy

Unlike previous methods for enhancing HDR infrared images, we argue that the LDR image obtained by mapping the HDR infrared image should not only align with human vision but also be well-perceptible by computers, specifically for detection-oriented enhancement. Therefore, we propose embedding the low-level enhancement task into the high-level detection task for a bilevel optimization [59,60,61,62]. Assuming that the HDR image, the ground truth LDR image, and the prediction mask are all grayscale images of size

m \times n

, represented as column vectors X, Y, and

Z \in R^{m \times n \times 1}

, respectively, the two-layer optimization problem for enhancement and detection can be formulated as follows:

min_{w_{d}, w_{m}} L_{d} (Ψ (Y^{*}; w_{d})) s . t . Y^{*} = Φ (X; w_{m})

(10)

where

L_{d}

represents the detection loss,

Ψ

denotes a detection network with learnable parameters

w_{d}

, and

Φ

is the image enhancement network with learnable parameters

w_{m}

. The optimization of the image enhancement network satisfies the following constraint:

min_{w_{m}} L_{m a p} (Φ (X; w_{m})

(11)

where

L_{m a p}

is the mapping loss of enhancement. Thus, a cooperative training strategy can be derived from the bilevel optimization relationship, which transforms Equation (10) into

min_{w_{d}, w_{m}} L_{d} (Ψ (Y^{*}; w_{d}) + λ_{2} L_{m a p} (Φ (X; w_{m}))

(12)

s . t . Y^{*} = Φ (X; w_{m})

(13)

where

λ_{2}

is the trade-off parameter, set as 0.5.

In this way, the detection optimization constrained by enhancement is converted into mutual optimization, so as to obtain the optimal network parameters (

w_{d}

,

w_{m}

). The gradient propagation process of the cooperative training optimization for the image enhancement network and the detection network can be expressed as

\frac{\partial L_{d}}{\partial w_{d}} = \frac{\partial L_{d}}{\partial Ψ_{d}} \frac{\partial Ψ_{d}}{\partial w_{d}}, \frac{\partial L_{d}}{\partial w_{m}} = \frac{\partial L_{d}}{\partial Ψ_{d}} \frac{\partial Φ_{d}}{\partial Ψ_{m}} \frac{\partial Ψ_{m}}{\partial w_{m}} + λ_{2} \frac{\partial L_{m a p}}{\partial Ψ_{m}} \frac{\partial Ψ_{m}}{\partial w_{m}}

(14)

The above equation indicates that the gradient of the detection loss of the detection network is backpropagated together with the mapping loss gradient of the quantized enhancement network. This kind of cooperative training optimization strategy yields enhancement results that not only conform to human vision but are also optimal for detection, and it can converge more efficiently than the solution of training optimization separately.

4. Experiments and Analysis

This section introduces the evaluation metrics and implementation details. We compare the proposed HDR-IRSTD-Net framework with several SOTA methods. Finally, an ablation study is conducted to investigate the effectiveness of our approach.

4.1. Experimental Setup

4.1.1. Dataset

We collected a large number of raw infrared images from infrared sensors across different wavelength ranges (long-wave, mid-wave, short-wave, etc.). After analyzing the dynamic range of raw infrared images with various imaging backgrounds and target types, we generated multiple HDR images with different dynamic ranges based on the NUDT-SIRST [13] and SIRST [63] datasets. This was performed using the methods from references [64,65], and linear stretching (LS) with a ratio of 7:2:1. The generated images were used to create the HDR infrared small target datasets, NUDT-SIRST-16bit and SIRST-16bit, for training tasks such as HDR infrared image enhancement and small target detection. When generating HDR images using the method from reference [64], the steps and parameter settings from that reference were applied. In the method described in reference [65], 7916 pairs of LDR/HDR images were randomly selected from the collected images, and training was performed using the parameters from that reference.

Figure 4 illustrates the comparison between the HDR images reconstructed from LDR inputs of the FLIR dataset and their corresponding ground-truth HDR images. The results indicate that the method in reference [64] restores certain local details and amplifies or introduces noise. The method in reference [65] achieves better recovery of local details. LS restores a limited amount of fine details. Overall, the synthetic HDR images are able to recover part of the real HDR details and capture the general dynamic range of the scenes. Figure 5 shows the HDR images generated by the three methods and their corresponding pixel distributions. From the pixel distributions, it can be observed that some image details have been restored. Figure 6 displays the grayscale distribution of the real raw infrared images and the generated NUDT-SIRST-16bit images. The dynamic range of our generated HDR infrared images is more complex and variable, which effectively tests the performance of the algorithm.

4.1.2. Evaluation Metrics

Evaluation Metrics of Image Enhancement. For image reconstruction tasks such as HDR infrared image enhancement, when ground truth is available, the algorithm performance is typically evaluated using Peak Signal-to-Noise Ratio (PSNR) [66] and Structural Similarity Index (SSIM) [67].

PSNR measures the similarity between the reconstructed image and the ground truth. A PSNR value greater than 30 decibels (dB) is considered to indicate high-quality reconstruction, and it is defined as

P S N R = 10 {log}_{10} (\frac{(2^{n} - 1)}{\sqrt{M S E}})

(15)

where n is the number of bits per sample, and MSE is the mean squared error between the reconstructed LDR image and the ground truth LDR image.

SSIM quantifies the similarity between the reconstructed image and the ground truth by assessing the structural information. For two images x and y, SSIM is defined as

S S I M = (\frac{2 μ_{x} μ_{y} + c_{1}}{μ_{x}^{2} + μ_{y}^{2} + c_{1}}) (\frac{2 σ_{x y} + c_{2}}{σ_{x}^{2} + σ_{y}^{2} + c_{2}})

(16)

where

μ_{x}

and

μ_{y}

represent the means of variables x and y, respectively,

σ_{x y}

denotes the covariance between x and y,

σ_{x}^{2}

and

σ_{y}^{2}

denote the variances of x and y, respectively, and

c_{1}

and

c_{2}

are constants introduced to prevent division.

Evaluation Metrics of Infrared Small Target Detection. For infrared small target detection tasks, in order to objectively analyze the detection performance of various algorithms, we use mIoU, Pd, Fa, ROC, FLOPs and FPS as quantitative evaluation metrics.

The Intersection of Union (IoU) is calculated as the intersection of the true and predicted pixel values divided by their union. Mathematically, it can be expressed as

IoU = \frac{A_{inter}}{A_{a l l}}

(17)

where

A_{i n t e r}

and

A_{a l l}

denote the intersection region and the concatenation region, respectively.

The mIoU is defined as

m I o U = \frac{1}{k + 1} \sum_{0}^{k} I o U_{i}

(18)

where k represents the number of category labels, and

k + 1

represents the total number of categories.

Probability of Detection (Pd) [68] represents the proportion of correctly detected targets among all detected targets, defined as

P d = \frac{T P}{T P + F N}

(19)

where

T P

denotes the number of correctly detected pixels, and

F N

denotes the number of pixels incorrectly classified as background.

The false-alarm rate (Fa) measures the proportion of incorrectly predicted pixels among all image pixels, and is defined as

F a = \frac{P_{f a l s e}}{P_{a l l}}

(20)

where

P_{f a l s e}

and

P_{a l l}

represent the number of false predicted pixels and the total number of pixels in the image, respectively.

The Receiver Operating Characteristic (ROC) curve plots Fa on the horizontal axis and Pd on the vertical axis. The Area Under the Curve (AUC) represents a comprehensive metric derived from the ROC curve, with larger values indicating superior method performance.

FLOPs quantify the floating-point operations executed within a network, serving as a metric for network complexity. Network parameters denote the overall number of trainable parameters in the network model, indicating the model’s size. In detection networks, FLOPs and network parameters are indicative of the network’s hardware computational and memory demands, while also ensuring detection performance.

The Relationship Between Enhancement and Detection Metrics. The core objective of image enhancement is to improve the visual quality of images, whereas the performance of object detection models heavily relies on the recognizability of target features in the input images. The SSIM measures the structural consistency between the enhanced image and the reference image. In object detection, IoU depends on the clarity of the target structures. Images with high SSIM allow the model to more accurately localize the target, thereby improving the IoU.

PSNR reflects the noise level in an image and quantifies the denoising effect of enhancement methods. Noise can interfere with the extraction of target features, leading to false positives or missed detections by the model. An increase in PSNR after enhancement often correlates with improved Precision and F1-score in target detection.

However, the relationship between enhancement metrics and detection metrics is not strictly linear. Some enhancement methods may improve PSNR/SSIM but blur the target details, resulting in a decrease in detection IoU and Pd. Conversely, certain enhancement methods may lower PSNR but by highlighting critical target features actually improve detection Pd. Ultimately, the metrics of downstream tasks, such as object detection, serve as the ultimate standard for evaluating the practicality of image enhancement methods.

4.1.3. Implementation Details

During training, the dataset is split into a 7:3 ratio for training and testing. To prevent overfitting, data augmentation techniques such as random flipping and random rotation are applied. The hyperparameters for direct training optimization (DT) and collaborative training optimization (CT) of MFEF-Net and AVM-UNet are set as follows: AdamW [69] optimizer, an initial learning rate of

5.0 \times 10^{- 4}

, and a CosineAnnealingLR scheduler [70] with a maximum iteration count of 50, and a minimum learning rate of

1.0 \times 10^{- 5}

. The batch size is set to 8, and the number of epochs is 1000. Our method is implemented on a Windows 10 platform (Intel I9-13900KF @ 3 GHz, NVIDIA GeForce RTX 4090 GPU) using PyTorch [71].

4.2. HDR Infrared Image Enhancement Results

We compare the HDR infrared image enhancement performance of the MFEF-Net network with five other methods (HE, CLAHE, Retinex, Gamma, and Log) on the NUDT-SIRST-16bit validation set, based on visual effects and quantitative metrics.

As shown in Figure 7, by comparing the enhanced visual results of different methods on the test images, HE enhancement increases the overall contrast of the image, making the pixel distribution more uniform, but it suffers from issues such as detail loss and over-enhancement. Specifically, HE amplifies noise while increasing contrast, which leads to image structural distortion. After CLAHE enhancement, the image appears overall darker, with pixel values concentrated in a smaller range of gray levels, and the contrast is low. The Retinex-enhanced image maintains a good balance of contrast and detail but exhibits over-enhancement and halo effects. The Gamma and Log enhanced images generally show low contrast. Our algorithm, whether trained alone or in collaboration with the detection algorithm (AVM-UNet), produces enhanced images whose histograms more closely resemble the LDR ground truth histogram. It effectively suppresses noise, preserves image details better, significantly improves contrast, and makes it easier to distinguish targets from the background. Additionally, the enhanced images align more with human visual habits, demonstrating comprehensive superiority over traditional methods.

To comprehensively evaluate the performance of HDR infrared image enhancement, both visual assessment and quantitative metrics are required. As seen in Table 1, HE achieves average overall performance in terms of PSNR, but its SSIM is relatively low and exhibits large fluctuations, indicating insufficient preservation of structural details and a tendency toward over-enhancement or noise amplification. CLAHE achieves PSNR values comparable to HE but with even lower SSIM, suggesting more severe structural distortion and overall weaker performance than HE. Gamma and Log methods show large variations in both PSNR and SSIM across different images, indicating high sensitivity to parameter settings, poor robustness, and enhancement results that depend heavily on specific image content and parameter configurations. The Retinex method outperforms other traditional methods across all metrics, but some images still show relatively low SSIM, reflecting its limited capability in restoring structural details in complex scenes. When trained independently, MFEF-Net (DT) achieves slightly lower PSNR compared to others, though consistently above 30dB, while its SSIM remains very high. In contrast, MFEF-Net (CT), when cooperatively trained with the detection algorithm, achieves excellent performance in both PSNR and SSIM across all images. Combined with the visual results, this demonstrates that the proposed method provides superior performance in contrast enhancement, noise suppression, and detail preservation, with greater robustness overall.

4.3. Analysis of Infrared Small Target Detection Results

4.3.1. AVM-UNet Detection Performance Analysis

To showcase the efficacy of our approach, we extensively compared it with SOTA deep learning methods (IAA-Net, RepISD-Net, DNA-Net, UIU-Net, and AMFU-Net) on the NUDT-SIRST [13] and SIRST [63] datasets. Previous research has consistently highlighted the substantial performance disparity between traditional algorithms and CNN-based algorithms in handling intricate scenarios within the NUDT-SIRST and SIRST datasets, primarily attributed to the former’s dependence on manual feature extraction. Further analysis of this comparison is therefore deemed unnecessary. For objective comparisons, we utilized the open-source code provided in the original papers for all algorithms, maintaining default parameter settings. Both the training and validation sets were retrained with a 7:3 split.

Table 2 presents the performance of our method compared to other SOTA methods in terms of mIoU, Pd, Fa, FLOPs, Params, and FPS. RepISD-Net and AMFU-Net, due to their lightweight design, have fewer parameters and fast detection speeds. However, they sacrifice some accuracy and have higher false-alarm rates. These methods are suitable for lightweight real-time scenarios but may not be sufficient for accurate detection tasks in complex backgrounds. IAA-Net performs well on the NUDT-SIRST dataset; however, its computational overhead and parameter count are significantly higher than other algorithms, resulting in the slowest inference speed and poor real-time performance. UIU-Net exhibits outstanding performance on both datasets but suffers from low inference speed due to its large parameter count and high computational cost, limiting its applicability. DNA-Net (Res18) shows balanced performance across all metrics but lacks significant advantages, with segmentation performance and computational efficiency both at a moderate level. Our algorithm outperforms other methods in terms of mIoU, Pd, and Fa, demonstrating superior overall performance. This is due to the use of SSM as the backbone, which allows us to maintain high efficiency while achieving very high accuracy.

The ROC performance of our method compared to other SOTA methods is shown in Figure 8. UIU-Net and IAA-Net, due to their nested structures and large parameter counts, exhibit relatively stable performance. They demonstrate a certain detection capability at moderate false alarm rates, but at lower false alarm rates, their detection probabilities are significantly lower than the proposed algorithm. RepISD-Net performs slightly worse than the first two methods, with a faster decay in detection probability as the false alarm rate decreases. AMFU-Net and DNA-Net(Res18) show much lower detection probabilities, even at higher false alarm rates, making it difficult to maintain high detection rates while suppressing false alarms. Our method, however, maintains a high detection rate even at very low false alarm rates, demonstrating more stable performance compared to the other methods.

Figure 9 and Figure 10 present the visual detection results of different methods on the NUDT-SIRST and SIRST datasets, respectively. Although IAA-Net can capture the target regions, it suffers from false detections, particularly in low-gray background areas, leading to a higher number of false-alarms. RepISD-Net exhibits missed detections in small-size targets or low-contrast scenes and shows incomplete segmentation of target regions. DNA-Net does not show significant missed detections or severe misdetections on NUDT-SIRST, but it lacks sufficient edge refinement, causing distortions in transition areas along the edges. Furthermore, it has a relatively high false-alarm rate on the SIRST dataset. UIU-Net can accurately detect targets, but it experiences some false alarms and missed detections in extremely low-contrast images. AMFU-Net performs well in background suppression but has difficulty detecting small targets, with imprecise segmentation. The visual detection results further demonstrate that our algorithm achieves a much lower false alarm rate and missed detection rate, with more precise segmentation shapes. In scenes with high-background brightness, other algorithms exhibit some false alarms or missed detections, whereas our method maintains a higher level of accuracy. This is attributed to the use of an asymmetric multi-level input structure, which preserves more target detail features, while the SSM-based backbone network allows for better global context understanding.

4.3.2. Impact of HDR Infrared Image Enhancement on Detection Tasks

To further investigate the impact of HDR infrared image enhancement on detection tasks, we compared the detection results on AVM-UNet using enhanced images from different methods discussed in Section 4.2.

The results in Table 3 show that the proposed MFEF-Net, whether optimized independently or in collaboration with the detection network, achieves superior detection results in terms of mIoU, Pd, and Fa compared to the original LDR 8-bit images. Figure 11 further illustrates that when MFEF-Net is trained in collaboration with AVM-UNet, the enhanced results are more favorable for detection. When MFEF-Net is trained independently, the enhancement results are still among the best, while other methods’ enhanced images lead to poor detection performance, with extremely high false-alarm rates and almost no effective detection of targets.

4.4. Ablation Study

We conducted ablation studies on the NUDT-SIRST dataset to investigate the model architecture and parameter settings of our method, and to further validate the effectiveness of each module. In addition, we examined the contribution of the collaborative optimization strategy.

4.4.1. MFEF-Net Network Architecture Study

Ablation studies were conducted on MFEF-Net to evaluate the effectiveness of the multi-branch structure and the Res_CSAM module. To verify the contribution of the multi-branch structure, we configured a single branch with dilation = 1 (one branch), two branches with dilation = 1, 2 (two branch), and three branches with dilation = 1, 2, 3 (three branch), and compared them against the full MFEF-Net. To assess the effectiveness of Res_CSAM, we constructed a network that uses the same branches and dilation settings as MFEF-Net but replaces each branch with standard convolutional modules (w/o Res_CSAM), and compared it with the complete MFEF-Net. Figure 12 presents the evaluation results on the validation set using PSNR and SSIM as metrics, illustrating the performance of networks with different parameter configurations during training. Table 4 reports the IoU, Pd, and Fa values obtained by feeding the enhanced images produced by each configuration into AVM-UNet for detection.

As shown in Figure 12, the network with a multi-branch structure demonstrates significantly higher performance in terms of PSNR and SSIM compared to other architectures. Networks incorporating the Res_CSAM module show notably better enhancement results than those using standard convolutional modules, indicating that both the multi-branch structure and Res_CSAM play crucial roles in feature re-mapping and noise suppression. As shown in Table 4, even when using standard convolutional modules instead of Res_CSAM, the network with a multi-branch structure outperforms the two branch network enhanced with the Res_CSAM module in terms of detection results, including mIoU, Pd, and Fa. The detection performance of the images enhanced by our method shows significant improvement in mIoU, Pd, and Fa compared to other network structures. Therefore, the multi-branch structure combined with the Res_CSAM module offers advantages in noise suppression and target feature extraction, making the enhanced images more suitable for detection tasks.

4.4.2. AVM-UNet Network Architecture Study

The ablation study on AVM-UNet primarily evaluates the contribution of multiple inputs. We configured variants with a single input, combinations of two different inputs, and the full AVM-UNet for comparison. Table 5 and Figure 13 present the detection performance of these network configurations on the validation set.

As shown in Table 5, the fully symmetric VM-UNet architecture demonstrates no improvement in FPS and a significant decline in detection accuracy when using smaller patch-size or increasing both patch-size and feature map dimensions. Any combination of two inputs outperforms single-input configurations, and all combinations containing Input 1 significantly outperform those without it. This indicates that Input 1 contributes the most independently and is the key information source. Figure 13 shows that our asymmetric design maintains a detection probability close to 1 even at extremely low false-alarm rates, achieving optimal detection performance. Smaller patch sizes help preserve fine-grained features but come with a high computational cost. By cascading patches of different sizes as multi-level inputs, this asymmetric design can expand the receptive field while retaining fine-grained details, thus facilitating precise segmentation shapes. It strikes an excellent balance between detection accuracy and computational efficiency.

4.4.3. Effectiveness of Collaborative Optimization

While our method demonstrates good performance when trained independently, the experimental results in Section 4.2 and Section 4.3 show that collaborative optimization of low-level and high-level vision tasks significantly enhances the performance of both tasks. As shown in Table 1, the collaborative training MFEF-Net(CT) performs best in terms of PSNR and SSIM across multiple image groups. Table 3 shows that the detection results of the enhanced images from MFEF-Net(CT) outperform those from MFEF-Net(DT) in terms of mIoU, Pd, and Fa. Therefore, the enhanced images from collaborative optimization are more favorable for detection. Additionally, the results in Table 2 demonstrate that the detection network AVM-UNet(CT), when trained with collaborative optimization, improves by 2.24% in mIoU and 1.87% in Pd compared to AVM-UNet(DT). Collaborative optimization also helps improve the accuracy of the detection network.

To intuitively analyze and compare the effectiveness of the collaborative training strategy in both enhancement and detection, we visualized the intermediate feature maps and inference results of MFEF-Net (DT), MFEF-Net (CT), AVM-UNet (DT), and AVM-UNet (CT) under identical input images. As shown in Figure 14, from top to bottom are the inference results and the feature maps of Layer 1 and Layer 2 for MFEF-Net (DT) and MFEF-Net (CT), followed by the inference results and the feature maps of encoding Stage-2 and decoding Stage-2 for AVM-UNet (DT) and AVM-UNet (CT). For the comparison, six consecutive channels of corresponding feature maps were selected from both MFEF-Net (DT) and MFEF-Net (CT), as well as from AVM-UNet (DT) and AVM-UNet (CT).

The comparison reveals that MFEF-Net (DT), in both its first and second MFEF modules, captures effective texture features but exhibits insufficient edge discrimination. Moreover, there is a significant variation between the two layers, indicating a progressive loss of fine details. In contrast, the collaboratively trained MFEF-Net (CT) demonstrates stronger structural discrimination and inter-layer consistency, with sharper texture boundaries and a more natural transition between layers. This model effectively preserves both local and global structural information, resulting in better detail retention. Similarly, AVM-UNet (CT) shows improved focus on key structural regions during the encoding stage and exhibits clearer hierarchical structure and boundaries after decoding, indicating better feature continuity. Overall, MFEF-Net (CT) achieves a more effective balance between texture enhancement and noise suppression, producing more stable and visually clearer enhancement results, while AVM-UNet (CT) attains more precise feature localization and superior edge segmentation performance under the collaborative training framework.

4.5. Edge Device Verification and Compatibility Testing

To evaluate the feasibility of deploying the proposed model on edge devices, we conducted compatibility and performance tests on the NVIDIA Jetson Orin Nano (8 GB) platform. During testing, the model was converted to the ONNX format and optimized using TensorRT. It was then evaluated on the NUDT-SIRST test set over 50,000 iterations to obtain metrics including average inference latency, average power consumption, FPS, and peak memory usage. The experimental results are summarized in Table 6. The model achieved an average inference latency of 7.9 ms, a peak memory usage of 398 MB, an FPS of 126.58, and an average power consumption of 8 W, demonstrating good compatibility and real-time performance on edge embedded systems. Compared with other lightweight models, its performance is comparable.

5. Conclusions

This paper proposes an HDR infrared image enhancement framework for small target detection, called HDR-IRSTD-Net. The framework consists of MFEF-Net and AVM-UNet. MFEF-Net adopts a multi-branch feature extraction and fusion structure, which effectively suppresses noise, enhances contrast, and strengthens targets during the mapping process from HDR infrared images to LDR images. AVM-UNet uses SSM as the backbone network for infrared small target detection, with a multi-level input structure that reduces computational complexity while preserving more detailed target features. Compared to the SOTA CNN-based SIRST detection methods, our method exhibits superior performance. Furthermore, by adopting collaborative optimization, our framework achieves HDR image enhancement and detection that are both visually friendly and precise, maintaining a good balance between accuracy and efficiency. When inferred independently, AVM-UNet can be deployed on platforms with limited computational resources to achieve high-precision real-time detection. When MFEF-Net and AVM-UNet are used together for inference, they can be applied to ground-based infrared detection systems, enabling high-precision real-time detection.

Although our method demonstrates superior performance in HDR infrared image enhancement and infrared small target detection, there remains significant room for improvement. First, the HDR datasets used in this study were generated using various synthetic methods. Consequently, the extent to which the synthesized HDR images capture sensor noise and environmental variations present in real HDR infrared images cannot be accurately assessed. In future work, we aim to collect authentic HDR images whenever possible or enhance the realism of synthesized HDR images by incorporating simulated sensor noise as well as environmental factors such as temperature fluctuations and variable illumination. Second, the current HDR image enhancement algorithms require paired HDR/LDR images for training, whereas LDR images often need to be obtained using traditional methods. To address this limitation, we plan to investigate weakly supervised or unsupervised HDR infrared image enhancement approaches. Third, VMamba currently employs a cross-scan strategy that leverages inherent sequential properties, which constrains parallel computation and reduces efficiency. In future research, we will explore more efficient scanning strategies to further improve model performance.

Author Contributions

P.C. and F.G. proposed the original idea; F.G. and W.W. performed the experiments; F.G. wrote the manuscript; W.W. and P.C. reviewed and edited the manuscript; W.Z. contributed to the direction, and content, and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

NUDT-SIRST data is available at https://pan.baidu.com/s/1WdA_yOHDnIiyj4C9SbW_Kg?pwd=nudt (accessed on 20 July 2024). SIRST data is available at https://github.com/YimianDai/sirst (accessed on 20 July 2024).

Acknowledgments

The authors would like to thank Li for providing free NUDT-SIRST data and thank Dai for providing free SIRST data.

Conflicts of Interest

The authors declare that they have no conflicts of interests.

Abbreviations

The following abbreviations are used in this manuscript:

GT	Ground Truth
HDR	High Dynamic Range
LDR	Low Dynamic Range
AGC	Automatic Gain Control
HE	Histogram Equalization
CLAHE	Adaptive Limitation Histogram Equalization
PSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity Index

References

Wang, K.; Wu, X.; Zhou, P.; Chen, Z.; Zhang, R.; Yang, L.; Li, Y. AFE-Net: Attention-Guided Feature Enhancement Network for Infrared Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4208–4221. [Google Scholar] [CrossRef]
Barnard, K.J. Dynamic Range Compression for Visual Display of Thermal Infrared Imagery. In Imaging Systems and Applications; Optica Publishing Group: Washington, DC, USA, 2020. [Google Scholar]
Zhao, Y.H.; Wang, Y.Y.; Luo, H.B.; Li, F.Z. New technique for dynamic-range compression and contrast enhancement in infrared imaging systems. Hongwai yu Jiguang Gongcheng/Infrared Laser Eng. 2018, 47, 172–181. [Google Scholar]
Zeng, Y.; Zhang, Z.; Zhou, X.; Liu, Y. High dynamic range infrared image compression and denoising. In Proceedings of the 2019 International Conference on Information Technology and Computer Application (ITCA), Guangzhou, China, 20–22 December 2019. [Google Scholar]
Garcia, F. Real-time visualization of low contrast targets from high-dynamic range infrared images based on temporal digital detail enhancement filter. J. Electron. Imaging 2015, 24, 061103. [Google Scholar] [CrossRef]
Peli, E. Contrast in complex images. J. Opt. Soc. Am. A Opt. Image Sci. 1990, 7, 2032–2040. [Google Scholar] [CrossRef] [PubMed]
Vickers, V.E. Plateau equalization algorithm for real-time display of high-quality infrared imagery. Opt. Eng. 1996, 35, 1921–1926. [Google Scholar] [CrossRef]
Reza, A.M. Realization of the contrast limited adaptive Download from Inspec Ondisc histogram equalization (CLAHE) for real-time image enhancement. J. VLSI Signal Process. Syst. Signal Image Video Technol. 2004, 38, 35–44. [Google Scholar] [CrossRef]
Farid, H. Blind inverse gamma correction. IEEE Trans. Image Process. 2001, 10, 1428–1433. [Google Scholar] [CrossRef]
Drago, F.; Myszkowski, K.; Annen, T.; Chiba, N. Adaptive Logarithmic Mapping For Displaying High Contrast Scenes. Comput. Graph. Forum 2003, 22, 419–426. [Google Scholar] [CrossRef]
Land, E.H.; McCann, J.J. Lightness and retinex theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef]
Wang, K.W.; Du, S.Y.; Liu, C.X.; Cao, Z.G. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Li, B.Y.; Xiao, C.; Wang, L.G.; Wang, Y.Q.; Lin, Z.P.; Li, M.; An, W.; Guo, Y.L. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Hong, D.F.; Chanussot, J. UIU-Net: U-Net in U-Net for Infrared Small Object Detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a Fast and Flexible Solution for CNN - Based Image Denoising. IEEE Trans. Image Process. 2018, 27, 4608–4622. [Google Scholar] [CrossRef] [PubMed]
Anwar, S.; Barnes, N. Real Image Denoising with Feature Attention. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3155–3164. [Google Scholar]
Lee, K.; Lee, J.; Lee, J.; Hwang, S.; Lee, S. Brightness-based convolutional neural network for thermal image enhancement. IEEE Access 2017, 5, 2169–3536. [Google Scholar] [CrossRef]
Zuo, C.; Chen, Q.; Liu, N. Display and detail enhancement for high-dynamic-range infrared images. Opt. Eng. 2011, 50, 127401. [Google Scholar] [CrossRef]
Liu, N.; Zhao, D. Detail enhancement for high-dynamic-range infrared image based on guided image filter. Infrared Phys. Technol. 2014, 67, 138–147. [Google Scholar] [CrossRef]
Lee, J.; Ro, Y.M. Dual-Branch Structured De-Striping Convolution Network Using Parametric Noise Model. IEEE Access 2020, 8, 155519–155528. [Google Scholar] [CrossRef]
Ruan, J.; Li, J.; Xiang, S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2402.02491. Available online: https://arxiv.org/abs/2402.02491 (accessed on 13 April 2024). [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. Properties and performance of a center/surround Retinex. IEEE Trans. Image Process. 1997, 6, 451–462. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. A multi-scale Retinex for bridging the gap between color images and the human observation of scenes. Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef]
Zhang, H.; Chen, Z.Q.; Cao, J.Z.; Li, C. Infrared Image Enhancement Based on Adaptive Guided Filter and Global–Local Mapping. Photonics 2024, 11, 717. [Google Scholar] [CrossRef]
Tan, A.L.; Liao, X. Infrared Image Enhancement Algorithm Based on Detail Enhancement Guided Image Filtering. Vis. Comput. 2023, 39, 6491–6502. [Google Scholar] [CrossRef]
Zhang, F.F.; Dai, Y.M.; Chen, Y.H. Display Method for High Dynamic Range Infrared Image Based on Gradient Domain Guided Image Filter. Opt. Eng. 2024, 63, 013105. [Google Scholar] [CrossRef]
Guo, Z.Y.; Yu, X. Infrared and Visible Image Fusion Based on Saliency and Fast Guided Filtering. Infrared Phys. Technol. 2022, 123, 104178. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Ronda, V.; Chan, P. Max-Mean and Max-Median filters for detection of small-targets. In Proceedings of the Conference on Signal and Data Processing of Small Targets 1999, Denver, CO, USA, 20–22 July 1999; pp. 74–83. [Google Scholar]
Tom, V.T.; Peli, T.; Leung, M.; Bondaryk, J.E. Morphology-based algorithm for point target detection in infrared backgrounds. In Proceedings of the 5th Conf on Signal and Data Processing of Small Targets, Orlando, FL, USA, 12–14 April 1993; pp. 2–11. [Google Scholar]
Rivest, J.F.; Fortin, R. Detection of dim targets in digital infrared imagery by morphological image processing. Opt. Eng. 1996, 35, 1886–1893. [Google Scholar] [CrossRef]
Barnett, J. Statistical analysis of median subtraction filtering with application to point target detection in infrared backgrounds. In Proceedings of the Infrared Systems and Components III, Los Angeles, CA, USA, 16–17 January 1989; Volume 1050, pp. 10–18. [Google Scholar]
Deng, H.; Sun, X.; Liu, M.; Ye, C.; Zhou, X. Small Infrared Target Detection Based on Weighted Local Difference Measure. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4204–4214. [Google Scholar] [CrossRef]
Han, J.H.; Liu, S.B.; Qin, G.; Zhao, Q.; Zhang, H.H.; Li, N.N. A Local Contrast Method Combined With Adaptive Background Estimation for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1442–1446. [Google Scholar] [CrossRef]
Han, J.H.; Moradi, S.; Faramarzi, I.; Zhang, H.H.; Zhao, Q.; Zhang, X.J.; Li, N. Infrared Small Target Detection Based on the Weighted Strengthened Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1670–1674. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.T.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Wang, X.T.; Lu, R.T.; Bi, H.X.; Li, Y.H. An Infrared Small Target Detection Method Based on Attention Mechanism. Sensors 2023, 23, 8608. [Google Scholar] [CrossRef]
Shi, Y.F.; Wei, Y.T.; Yao, H.; Pan, D.H.; Xiao, G.R. High-Boost-Based Multiscale Local Contrast Measure for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2018, 15, 33–37. [Google Scholar] [CrossRef]
Zhu, H.; Liu, S.M.; Deng, L.Z.; Li, Y.S.; Xiao, F. Infrared Small Target Detection via Low-Rank Tensor Completion With Top-Hat Regularization. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1004–1016. [Google Scholar] [CrossRef]
He, Y.J.; Li, M.; Zhang, J.L.; Yao, J.P. Infrared Target Tracking Based on Robust Low-Rank Sparse Learning. IEEE Geosci. Remote Sens. Lett. 2016, 13, 232–236. [Google Scholar] [CrossRef]
Zhang, L.D.; Peng, L.B.; Zhang, T.F.; Cao, S.Y.; Peng, Z.M. Infrared Small Target Detection via Non-Convex Rank Approximation Minimization Joint l2.1 Norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
Dai, Y.M.; Wu, Y.Q. Reweighted Infrared Patch-Tensor Model With Both Nonlocal and Local Priors for Single-Frame Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Zhang, T.F.; Wu, H.; Liu, Y.H.; Peng, L.B.; Yang, C.P.; Peng, Z.M. Infrared Small Target Detection Based on Non-Convex Optimization with Lp-Norm Constraint. Remote Sens. 2019, 11, 559. [Google Scholar] [CrossRef]
Zhang, T.F.; Peng, Z.M.; Wu, H.; He, Y.M.; Li, C.H.; Yang, C.P. Infrared small target detection via self-regularized weighted sparse model. Neurocomputing 2021, 420, 124–148. [Google Scholar] [CrossRef]
Dai, Y.M.; Wu, Y.Q.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Zhao, J.M.; Yu, C.; Shi, Z.L.; Liu, Y.P.; Zhang, Y.D. Gradient-Guided Learning Network for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zhang, T.F.; Li, L.; Cao, S.Y.; Pu, T.; Peng, Z.M. Attention-Guided Pyramid Context Networks for Detecting Infrared Small Target Under Complex Background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Chung, W.Y.; Lee, I.H.; Park, C.G. Lightweight Infrared Small Target Detection Network Using Full-Scale Skip Connection U-Net. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Huang, H.M.; Lin, L.F.; Tong, R.F.; Hu, H.J.; Zhang, Q.W.; Iwamoto, Y.; Han, X.H.; Chen, Y.W.; Wu, J. UNET 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Wu, S.L.; Xiao, C.; Wang, L.G.; Wang, Y.Q.; Yang, J.A.; An, W. RepISD-Net: Learning Efficient Infrared Small-Target Detection Network via Structural Re-Parameterization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Wang, Z.; Zang, T.; Fu, Z.L.; Yang, H.; Du, W.L. RLPGB-Net: Reinforcement Learning of Feature Fusion and Global Context Boundary Attention for Infrared Dim Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Yu, W.; Wang, X. MambaOut: Do We Really Need Mamba for Vision? In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025.
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. Available online: https://arxiv.org/pdf/2312.00752 (accessed on 4 July 2024).
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. Available online: https://arxiv.org/pdf/2111.00396 (accessed on 4 July 2024).
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 2021, 34, 572–585. [Google Scholar]
Dao, T.; Gu, A. Transformers Are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024; pp. 10041–10071. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. Int. J. Comput. Vis. 2014, 106, 1032–1040. [Google Scholar]
Liu, R.; Gao, J.; Zhang, J.; Meng, D.; Lin, Z. Investigating Bi-Level Optimization for Learning and Vision From a Unified Perspective: A Survey and Beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 10045–10067. [Google Scholar] [CrossRef]
Liu, R.; Ma, L.; Yuan, X.; Zeng, S.; Zhang, J. Task-Oriented Convex Bilevel Optimization with Latent Feasibility. IEEE Trans. Image Process. A Publ. IEEE Signal Process. Soc. 2022, 31, 1190–1203. [Google Scholar] [CrossRef]
Ochs, P.; Ranftl, R.; Brox, T.; Pock, T. Bilevel optimization with nonsmooth lower level problems. In Scale Space and Variational Methods in Computer Vision; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9087, pp. 654–665. [Google Scholar]
Chen, Y.; Liu, C.; Huang, W.; Cheng, S.; Arcucci, R.; Xiong, Z. Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation. arXiv 2023, arXiv:2306.04811. Available online: https://arxiv.org/abs/2306.04811 (accessed on 25 October 2025).
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV 2021), Waikoloa, HI, USA, 3–8 January 2021. [Google Scholar]
Zhang, S.F.; Liu, M.Y.; Han, Z.X. Generation Method of High Dynamic Range Image from a Single Low Dynamic Range Image Based on Retinex Enhancement. J. Comput.-Aided Des. Comput. Graph. 2018, 30, 1016–1022. [Google Scholar] [CrossRef]
Chen, X.Y.; Liu, Y.H.; Zhang, Z.W.; Qiao, Y.; Dong, C. HDRUNet: Single Image HDR Reconstruction with Denoising and Dequantization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 354–363. [Google Scholar]
Jiang, C.; Shao, H. Fast 3D Reconstruction of UAV Images Based on Neural Radiance Field. Appl. Sci. 2023, 13, 10174. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Du, S.Y.; Wang, K.W.; Cao, Z.G. BPR-Net: Balancing Precision and Recall for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. Available online: https://arxiv.org/abs/1711.05101 (accessed on 13 April 2024). [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Paszke, A. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 8026–8037. [Google Scholar]

Figure 1. Overview of our proposed method.

Figure 2. MFEF-Net network structure. (a) Details of MFEF-Net. (b) Details of Res_CSAM block.

Figure 3. AVM-UNet network structure. (a) Details of AVM-UNet. (b) Details of VSS block.

Figure 4. Comparison between HDR infrared images generated by different methods and the ground-truth HDR infrared images on the FLIR dataset. Retinex-based refers to the method in reference [64], and HDRUet refers to the method in reference [65].

Figure 5. HDR images generated by the three methods and their corresponding pixel distributions. Retinex-based refers to the method in reference [64], and HDRUet refers to the method in reference [65].

Figure 6. The grayscale value distributions. Part of some collected real raw infrared images and those of NUDT-SIRST-16bit. (a) The maximum grayscale value distribution. (b) The minimum grayscale value distribution. (c) The dynamic range distribution.

Figure 7. The visual results of test images enhanced using different methods.

Figure 8. ROC performance of different algorithms on (a) NUAA-SIRST, (b) SIRST, respectively.

Figure 9. Various detection methods have yielded distinct qualitative outcomes on the NUDT-SIRST dataset. To enhance clarity, the designated region is magnified in the upper-left corner. Identified target regions, false-alarms, and missed detections are highlighted by blue, red, and green circles.

Figure 10. Various detection methods have yielded distinct qualitative outcomes on the SIRST dataset. To enhance clarity, the designated region is magnified in the upper-left corner. Identified target regions, false-alarms, and missed detections are highlighted by blue, red, and green circles.

Figure 11. The ROC performance of AVM-UNet with enhancement results from different methods as input.

Figure 12. Ablation experiment results on test set. (a) PSNR scores of different structures. (b) SSIM scores of different structures.

Figure 13. ROC performance of different AVM-UNet network structures.

Figure 14. Comparison of output feature maps at specific stages for different models.

Table 1. The PSNR and SSIM values from various advanced methods on the test set indicate that higher performance is associated with larger values. The best results are in red, and the second best results are in blue.

	Image1		Image2		Image3		Image4		Image5		Image6
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
HE	31.546	0.816	32.417	0.727	32.093	0.757	31.571	0.540	32.487	0.552	33.309	0.818
CLAHE	32.002	0.490	32.098	0.775	32.136	0.487	31.784	0.565	32.138	0.494	32.764	0.636
Retinex	32.679	0.919	32.032	0.851	32.891	0.768	32.165	0.750	32.150	0.882	32.469	0.805
Gamma	31.934	0.842	32.036	0.441	31.391	0.790	31.622	0.720	33.343	0.894	32.668	0.328
Log	32.684	0.926	31.694	0.507	31.676	0.930	31.819	0.760	31.912	0.909	31.881	0.272
MFEF-Net(DT)	31.282	0.986	31.604	0.973	31.687	0.940	31.087	0.935	31.671	0.957	35.411	0.918
MFEF-Net(CT)	31.291	0.978	32.078	0.970	31.616	0.939	31.544	0.945	31.657	0.954	33.470	0.952

Table 2. Comparison of detection performance (mIoU, Pd and Fa (

\times 10^{- 5}

)) and model efficiency (Params(M), FLOPs(G), and FPS) across various methods on the NUDT-SIRST and SIRST datasets. The best results are in red, and the second best results are in blue. AVM-UNet(CT) refers to the cooperative training and optimization of MFEF-Net with AVM-UNet, and the use of both subnetworks for end-to-end inference. FLOPs and Params refer to the total number of the two subnetworks combined.

Table 2. Comparison of detection performance (mIoU, Pd and Fa (

\times 10^{- 5}

)) and model efficiency (Params(M), FLOPs(G), and FPS) across various methods on the NUDT-SIRST and SIRST datasets. The best results are in red, and the second best results are in blue. AVM-UNet(CT) refers to the cooperative training and optimization of MFEF-Net with AVM-UNet, and the use of both subnetworks for end-to-end inference. FLOPs and Params refer to the total number of the two subnetworks combined.

		IAA-Net	RepISD-Net	DNA-Net(Res18)	UIU-Net	AMFU-Net	AVM-UNet(DT)	AVM-UNet(CT)
NUDT-SIRST Dataset	mIoU	0.8821	0.8795	0.8428	0.8972	0.8346	0.9102	0.9326
	Pd	0.9591	0.9438	0.9178	0.9223	0.8979	0.9579	0.9766
	Fa	8.4898	8.8168	11.7882	7.7097	12.6716	6.4878	4.7784
SIRST Dataset	mIoU	0.6989	0.6869	0.7424	0.7175	0.7661	0.7867	0.8225
	Pd	0.8312	0.8998	0.8240	0.9090	0.8502	0.8922	0.9026
	Fa	12.8292	13.8958	17.5253	11.1255	15.4810	11.0663	7.9553
Image size 256 × 256	FLOPs	438.35	7.09	14.28	54.50	5.95	1.65	43.19
	Params	18.250	0.310	4.697	50.545	0.473	6.353	6.988
	FPS	48.87	332.98	127.38	91.01	200.24	191.75	109.96

Table 3. Detection performance (IoU, Pd, and Fa) of the enhancement results of different methods as inputs on AVM-UNet. The best results are in red.

Method	mIoU	Pd	Fa ( $\times 10^{- 5}$ )
8-bit GT	0.9102	0.9579	6.4878
HE	0.2632	0.6967	57.7137
CLAHE	0.2739	0.7428	55.6372
Retinex	0.4446	0.8109	42.9712
Gamma	0.3978	0.7511	48.1053
Log	0.3695	0.7254	50.8358
MFEF-Net(DT)	0.9271	0.9716	5.1914
MFEF-Net(CT)	0.9326	0.9766	4.7784

Table 4. The detection performance metrics (IoU, Pd, Fa) on VM-AUNet using the images enhanced by MFEF-Net with different network structures as input. The best results are in red.

MFEF-Net Model (DT)	Params(M)	mIoU	Pd	Fa ( $\times 10^{- 5}$ )
one branch	0.188	0.4533	0.8233	42.0075
two branch	0.337	0.8488	0.9249	11.2605
three branch	0.486	0.9047	0.9621	6.8550
w/o Res_CSAM	0.335	0.8781	0.9431	8.9258
MFEF-Net	0.635	0.9271	0.9716	5.1914

Table 5. The detection performance (mIoU, Pd, and Fa (

\times 10^{- 5}

)) and model efficiency (Params(M) and FPS) of different AVM-UNet network structures. The best results are in red.

Table 5. The detection performance (mIoU, Pd, and Fa (

\times 10^{- 5}

)) and model efficiency (Params(M) and FPS) of different AVM-UNet network structures. The best results are in red.

AVM-UNet Model (DT)	Params(M)	mIoU	Pd	Fa ( $\times 10^{- 5}$ )	FPS
only Input 3	5.628	0.7239	0.8219	22.3814	217.27
only Input 2	1.420	0.7542	0.8584	19.1915	219.02
only Input 1	0.3616	0.8156	0.8921	14.1861	222.43
Input 2 + Input 3	6.349	0.8679	0.9381	9.7174	207.73
Input 1 + Input 3	6.339	0.8780	0.9457	8.9086	213.64
Input 1 + Input 2	1.597	0.8973	0.9483	7.4860	212.18
AVM-UNet	6.353	0.9102	0.9579	6.4878	191.75

Table 6. NVIDIA Jetson orin nano inference comparison.

Model	Params(M)	FLOPs(G)	Inference Latency (ms)	FPS	Power (W)	Memory (MB)
RepISD-Net	0.310	7.09	7.4	135.14	10	371
AMFU-Net	0.473	5.95	15.6	64.10	7	374
AVM-UNet	6.353	1.65	7.9	126.58	8	398

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, F.; Chen, P.; Zhao, W.; Wang, W. HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement. Automation 2025, 6, 86. https://doi.org/10.3390/automation6040086

AMA Style

Guo F, Chen P, Zhao W, Wang W. HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement. Automation. 2025; 6(4):86. https://doi.org/10.3390/automation6040086

Chicago/Turabian Style

Guo, Fugui, Pan Chen, Weiwei Zhao, and Weichao Wang. 2025. "HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement" Automation 6, no. 4: 86. https://doi.org/10.3390/automation6040086

APA Style

Guo, F., Chen, P., Zhao, W., & Wang, W. (2025). HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement. Automation, 6(4), 86. https://doi.org/10.3390/automation6040086

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Infrared Image Enhancement and Dynamic Range Compression

2.2. Single-Frame Infrared Small Target Detection

2.3. State Space Models and Their Advantages in the Field of Target Segmentation

3. Proposed Method

3.1. Overview of the Detection Framework Based on HDR Infrared Image Enhancement

3.2. MFEF-Net

3.3. AVM-UNet

3.4. Loss Functions and Cooperative Training Strategy

3.4.1. Loss Function

3.4.2. Cooperative Training Strategy

4. Experiments and Analysis

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. HDR Infrared Image Enhancement Results

4.3. Analysis of Infrared Small Target Detection Results

4.3.1. AVM-UNet Detection Performance Analysis

4.3.2. Impact of HDR Infrared Image Enhancement on Detection Tasks

4.4. Ablation Study

4.4.1. MFEF-Net Network Architecture Study

4.4.2. AVM-UNet Network Architecture Study

4.4.3. Effectiveness of Collaborative Optimization

4.5. Edge Device Verification and Compatibility Testing

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI