Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection

Zhang, Mingjin; Yang, Handi; Yue, Ke; Zhang, Xiaoyu; Zhu, Yuqi; Li, Yunsong

doi:10.3390/rs15194716

Open AccessArticle

Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection

by

Mingjin Zhang

,

Handi Yang

^*

,

Ke Yue

,

Xiaoyu Zhang

,

Yuqi Zhu

and

Yunsong Li

State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(19), 4716; https://doi.org/10.3390/rs15194716

Submission received: 25 July 2023 / Revised: 30 August 2023 / Accepted: 31 August 2023 / Published: 26 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Infrared small target detection (IRSTD) is widely used in many fields such as detection and guidance systems and is of great research importance. However, small targets in infrared images are typically small, blurry, feature-poor, and prone to being overwhelmed by noisy backgrounds, posing a significant challenge for IRSTD. In this paper, we propose a thermodynamics-inspired multi-feature network (TMNet) for the IRSTD task, which extracts richer and more essential semantic features of infrared targets through cross-layer and multi-scale feature fusion, along with the assistance of a thermodynamics-inspired super-resolution branch. Specifically, it consists of an attention-directed feature cross-aggregation encoder (AFCE), a U-Net backbone decoder, and a thermodynamic super-resolution branch (TSB). In the shrinkage path, the original encoder structure is reconstructed as AFCE, which contains two depth-weighted multi-scale attention modules (DMA) and a cross-layer feature fusion module (CFF). The DMA and CFF modules achieve self-feature-guided multi-scale feature fusion and cross-layer feature interaction by utilizing semantic features from different stages in the encoding process. In thermodynamics, the difference in the formation of different heat between particles leads to heat transfer between objects, which inspired us to analogize the feature extraction process of gradually focusing the network’s attention to an infrared target under the constraints of the loss function to the process of heat transfer. On the expansion path, the TSB module incorporates the Hamming equation of thermodynamics to mine infrared detail features through heat transfer-inspired high-resolution feature representations while assisting the low-resolution branch to learn high-resolution features. We conduct extensive experiments on the publicly available NUAA-SIRSST dataset and find that the proposed TMNet exhibits excellent detection performance in both pixel-level and object-level metrics. This discovery provides us with a relatively dependable guideline for formulating network designs aimed at IRSTD.

Keywords:

infrared small target detection; multi-scale attention; thermodynamics-inspired network; cross-layer feature fusion; super-resolution

Graphical Abstract

1. Introduction

The identification of small targets in infrared images stands as a pivotal technology within the realm of target recognition. Unlike visible imaging mechanisms, infrared imaging can penetrate obstacles and capture more target information in low-light conditions. Accordingly, IRSTD assumes an indispensable role in various domains [1,2,3,4,5], such as detection and guidance systems, early warning systems, and maritime rescue systems. In general, the distance between the object identified by the IRSTD task and the infrared sensor is very long, and the target often occupies merely a few pixels and lacks comprehensive details such as form and texture. This characteristic has been a thorny hurdle in the domain of IRSTD. Furthermore, the energy of infrared radiation decays with increasing imaging distance, greatly reducing the contrast between target feature information and background noise. Therefore, deeply mining the detailed features of infrared targets and minimizing the accuracy loss in image semantic segmentation is crucial to solving this challenge.

To address the challenges of IRSTD, traditional methods [6,7,8,9,10,11] typically treat it as an issue of image filtering and target enhancement. Based on different directions, such as background spatial consistency [12,13] and target saliency [14,15,16], many innovative methods have been proposed. Based on background spatial consistency, in the field of filtering, max-median/max-mean [12] methods suppress edge information by computing the filtering outcomes along various directions and selecting the maximum value from each direction, thereby suppressing edge information. In contrast, top-hat [17] applies a specifically shaped filtering window to traverse the entire image and performs erosion and dilation operations on each pixel to highlight the target. Filtering methods can only suppress relatively simple background clutter, and their performance becomes highly unstable in the presence of complex noise interference. For target saliency methods, they are influenced by the traits of the human visual system (HVS), assuming that the infrared targets are the most salient objects. For example, the spectral residual method [14] focuses on the variations in the image background and extracts the most prominent parts by leveraging spectral residual in the image’s spectral domain to eliminate the background. The local contrast methods, such as tri-layer local contrast measure (TLLCM) [16] and weighted strengthened local contrast measure (WSLCM) [15], essentially extract targets by considering the dissimilarity information between the current position and its nearby local neighborhood as the foundation for target extraction. While target saliency methods have superior performance in certain scenarios, they perform poorly in low-contrast environments.

The advancement of deep learning has led to an increasing number of proposed methods that aim to improve the accuracy of infrared target detection and overcome performance instability in complex environments. For instance, Miss Detection vs. False Alarm (MDvsFA) [18] divides the IRSTD into two sub-tasks, each independently handled by two different generation adversarial (GAN) [19] models. To achieve the best detection accuracy, each GAN model focuses on reducing miss detection (MD) or false alarm (FA), respectively, thereby reducing both MD and FA in a multi-objective manner. Networks based on the encoder–decoder [20] architecture have achieved impressive results in the IRSTD domain and have been widely applied. For instance, Dai et al. propose adding an asymmetric context modulation (ACM) [21] module on top of the encoder–decoder structure in the neck region. This module is used to fuse low-level semantic information with high-level semantics, thereby avoiding the loss of feature information from infrared images during the encoding process. In addition to proposing a context semantic attention module suitable for infrared tasks, Dai et al. introduce an attentional local contrast network (ALCNet) [22], which combines traditional local contrast measurement methods with feature learning methods through a designed feature mapping cyclic displacement scheme. By leveraging a bottom-up local attention modulation module, ALCNet embeds low-level semantics into high-level semantics. The network attention mechanism designed for the IRSTD task has yielded promising results. Zhang et al. [23] develop the Runge–Kutta transformer (RKformer) [24] method, which employs concepts of the Runge–Kutta equation [25,26] to design a parallel convolution and transformer [27] approach, replacing the conventional encoding process. Furthermore, a cross-level correlation network (FC3-Net) [28] proposed by Zhang et al. utilizes fine-detail-guided multi-level feature compensation (F-MFC) Module and cross-level feature correlation (CFC) Module to not only compensate for the feature loss resulting from the variation in feature map size but also further amplify the network’s capacity to locate and represent the shape of the target. Song et al. proposed the amorphous variable inter-located network (AVILNet) [29] built on GridNet [30], which achieves a time-saving optimally structured network through a multi-scale attention integration module and a unique fusion strategy. YOLOSR-IST [31] proposed by Li et al. effectively improves the leakage and misdetection problems of data-driven detection-based methods through super-resolution methods and transformer-based feature blocks. The dual-domain prior-driven deep network (DPDNet) [32] proposed by Hao et al. includes three driver modules: a sparse feature driver module, a high-frequency feature driver module, and a primary detection module to jointly guide the network to efficiently learn infrared small target features. Furthermore, the asymmetric patch attention fusion network (APAFNet) [33] proposed by Wang et al. achieves more comprehensive semantic information details by modulating high-level semantic information and low-level semantic information in different scenarios through asymmetric patch attention fusion (APAF) modules and expanding context blocks. However, existing research mainly focuses on feature learning from raw-resolution images, and rarely utilizes infrared physical features and phenomena for feature extraction and information interaction, which makes the network prone to losing target feature details, thereby affecting detection accuracy. In thermodynamics, particles with different energies move over time in the same closed environment, with particles of higher heat spontaneously interacting with particles of lower heat. In the study of the IRSTD task, the process by which the IRSTD algorithm learns an infrared target can be analogously perceived as a process in which the infrared features spontaneously pass through the network under the constraints of the loss function and ultimately focus the network’s attention entirely on the small target region. Therefore, we link the motion of infrared features in the neural network to the phenomenon of heat transfer and further try to model the infrared feature segmentation results closer to the real target by using thermodynamic equations. In addition, the similarity between the process of gradual conversion of an image from low resolution to high resolution by the super-resolution method and the process of gradually focusing the network’s attention to the infrared target region by IRSTD inspired us to apply the thermodynamic method with the super-resolution method to the IRSTD task.

For this reason, we propose a thermodynamics-inspired multi-feature network (TMNet), which takes the backbone network of U-Net as the main structure and consists of an attention-directed feature cross-aggregation encoder (AFCE), U-Net backbone decoder, and thermodynamic super-resolution branch (TSB). In the design of TMNet, we creatively propose to optimize the whole link of the network to improve the network structure from both top-down and bottom-up aspects. In the top-down path, we reconstruct the original encoder and design AFCE, which comprises a series of cascaded rule residual blocks, a cross-layer feature fusion (CFF) module and two depth-weighted multi-scale attention (DMA) modules. The input image is received by the DMA module at each level of residual blocks in the encoder path. Then, the DMA module performs a weighted fusion of feature images by an attention mechanism using depth vectors as weights and passes the results to the next level of residual blocks and the CFF module. Subsequently, the CFF module cross-fuses the feature images from the residual blocks at each level and passes them to the decoder. As a result, we effectively extract multi-scale semantic features and enable cross-layer semantic interaction in the encoding–decoding structure, thereby preserving the rich semantic features of infrared images. In the bottom-up path, we add the TSB module, which can introduce a thermodynamics-inspired cooperative mechanism to super-resolution images to assist the semantic segmentation operation. The TSB module combines the Hamming equation to extract super-resolution features, which enables each layer’s feature map in the decoding stage to be assisted by corresponding super-resolution feature maps for learning. By adding a super-resolution branch loss function, the branch becomes trainable, resulting in better capturing of high-resolution semantic features while preserving low-resolution features. To assess the efficacy of the proposed TMNet, we perform thorough experiments on the publicly available NUAA-SIRST dataset and conclude that TMNet has a better performance compared to state-of-the-art (SOTA) methods.

Overall, the contributions of this paper are mainly in three aspects:

We introduce an innovative IRSTD model, TMNet, which leverages an innovative super-resolution branch for assisted feature learning and explores and fuses multi-scale features through full-link connections, demonstrating outstanding performance on the NUAA-SIRST dataset.
We reconstruct the encoder and propose a new AFCE structure, which utilizes generated depth vectors to induce multi-scale feature image fusion, enabling the comprehensive exploration of spatial detail information features.
We introduce a thermodynamics-inspired cooperative mechanism by creating the TSB, which combines the Hamming equation of the thermodynamic and super-resolution to enhance the high-resolution representation under low-resolution input.

2. Related Work

2.1. Infrared Small Target Detection

Existing IRSTD methods can be divided into traditional methods and deep learning-based methods. Traditional methods rely on non-learning or heuristic image processing techniques, approaching the IRSTD problem as an image filtering and target enhancement problem. The traditional methods include filter-based methods such as top-hat filter (Top-hat) [17], max-median/max-mean filter [12], two-dimensional least-mean-square (TDLMS) [34] filter and two-dimensional variational mode decomposition (TDVMD) [35] method, as well as target saliency-based methods such as the spectral residual method [14], weighted strengthened local contrast measure (WSLCM) [15] and tri-layer local contrast measure (TLLCM) [16], infrared patch-image (IPI) [36], partial sum of the tensor nuclear norm (PSTNN) [6] and non-convex rank approximation minimization (NARM) [7]. Nevertheless, traditional methods are often limited to specific and simple application scenarios, and when dealing with interference from clutter and noise in complex backgrounds, their performance fluctuates significantly, leading to detection failures as they fail to accurately preserve infrared target features.

In order to improve the robustness of the method and make it adaptable to most complex environments, deep learning models based on convolutional neural networks (CNN) [37,38] have gradually shown to possess excellent performance in SIRTD tasks. Zhao et al. propose a lightweight CNN network called TBC-Net [39], which effectively balances the infrared image targets and background through a joint loss function, semantic modulation module, and target extraction module. Dai et al. develop an ALCNet [22], which combines traditional local contrast methods with heuristic approaches to pairing easily lost features with deeper features through the cyclic transfer of feature paths. Additionally, they introduce an ACM [21] module that extracts infrared target information from low-level semantic information using attention mechanisms and integrates it with high-level semantics to obtain more effective infrared semantic features in real time. Li et al. present a dense nested attention network (DNA-Net) [40] by building upon the coding–decoding structure and incorporating densely nested multi-directional feature interaction modules and cascaded feature attention mechanisms. By repeatedly fusing and utilizing features from different time periods, the network effectively harnesses infrared information. Zhao et al. design a miss detection vs. false alarm different generation adversarial (MDvsFAGAN) [18] model, which utilizes adversarial learning to suppress miss detection and false alarm. To better utilize manually crafted features in the presence of complex background interference, Zhang et al. [28] propose an F-MFC module and cross-level feature correlation (CFC) Module, Which effectively restore the image edge information of the target and reduce the loss of infrared features in the network, thereby preserving more information. Furthermore, they develop a curvature half-level fusion network (CHFNet) [41], a model that extracts image edge information based on the curvature feature and achieves more accurate target extraction by fusing and filtering features from each layer. While current CNN-based models for IRSTD demonstrate excellent performance, most of them focus on incorporating attention mechanisms by enhancing the encoding structure. However, they often overlook the overall architecture of the network, resulting in limited richness and effectiveness in feature extraction.

2.2. Cross-Layer Feature Fusion

The semantic segmentation network paradigm based on the encoding–decoding structure has shown excellent performance in segmentation tasks, but its high-level semantic features and low-level semantic features are distributed at the two ends of the network, so popular detection networks [42,43,44] aggregate multiple layers of features [45] through cross-layer feature fusion to improve segmentation performance. Common approaches for cross-layer feature fusion include skip connections like UNet [20], Deeplabv3+ [46], and feature pyramid network (FPN) [47], gate-based fusion methods such as gated full fusion network (GFFNet) [48], and alignment of features across different layers using semantic flow, as in SFNet [49]. However, long-distance feature propagation pathways still result in semantic feature loss and feature mismatch. To guide the long-distance information flow with semantic features from both ends of the encoder–decoder, many networks employ attention mechanisms, such as dual attention network (DANet) [50], object context network (OCNet) [51], criss-cross network (CCNet) [52], expectation-maximization attention network (EMANet) [53], and squeeze-and-attention network (SANet) [54]. In the field of IRSTD, Dai et al. design feature fusion networks, ACMNet [21] and ALCNet [22], based on the interaction between low-level and high-level semantics. Additionally, Zhang et al. introduce cross-layer feature fusion networks such as CHFNet [41] and FC3-Net [28]. However, due to the low signal-to-noise ratio of infrared target features and the presence of background clutter, relying solely on a single scale or a single-side network structure makes it challenging to completely overcome the loss of features for small targets. Consequently, we introduce a cross-layer and multi-scale interactive semantic feature fusion mechanism.

2.3. Image Super-Resolution

Image super-resolution is the process of restoring an image from a low resolution to a high resolution. Deep learning-based image super-resolution methods have showcased their excellent performance in various tasks in recent years, and the existing mainstream super-resolution methods can be classified into single upsampling methods [55,56,57,58] and multiple upsampling methods [59,60,61] based on the number of samples. Wang et al. propose a dual super-resolution learning (DSRL) [62] method, which introduces image resolution into segmentation tasks for the first time. DSRL applies super-resolution methods to obtain higher-resolution images to assist in segmenting the network, resulting in more accurate segmentation results. Recently, ordinary differential equation (ODE) methods [25,26,63] have shown great potential in the design of neural networks in deep learning. For example, He et al. [25] design a novel super-resolution network based on the forward Euler method. However, in IRSTD tasks, one-off super-resolution image-assisted segmentation makes it difficult to recover the detailed semantic features of small targets lost during encoding. Inspired by the tendency of thermal particles to always gradually transition from an unstable high-temperature state to a more stable low-temperature state, we draw a parallel between the diffusion process of infrared features in the network and this phenomenon. Therefore, we design a novel super-resolution branch based on the Hamming equation of thermodynamics to assist low-resolution features in learning high-resolution features, which in turn assists the segmentation branch in obtaining more accurate semantic features.

3. Method

3.1. Network Overview

The overall architecture of the TMNet is shown in Figure 1. It comprises the Attention-directed Feature Cross-aggregation Encoder (AFCE), a U-Net backbone decoder, and a thermodynamic Super-resolution Branch (TSB). Considering the superiority of U-Net in infrared small target detection and semantic segmentation accuracy, our network takes the backbone network of U-Net as the main structure and improves the network structure in both top-down and bottom-up aspects over the whole link.

On the one hand, we reconstruct the original encoder in the top-down path and design the AFCE, which comprises a series of cascaded rule residual blocks, a CFF module, and two DMA modules. Each DMA module receives the input information from the corresponding residual block, uses the depth cue information as the weight of the attention mechanism to guide the fusion of multi-scale features, generates the fused feature map, and passes it to the next level residual block and the CFF module, which will fully guide the exchange and fusion of features among the residual blocks at each level to achieve higher-accuracy feature extraction. This process can be defined as follows:

F_{1} = H_{D M A} (C_{1}),

(1)

F_{2} = H_{D M A} (C_{2}),

(2)

N_{1}, N_{2} = H_{C F F} (F_{1}, F_{2}),

(3)

where

C_{1}

and

C_{2}

signify the output feature maps of distinct encoding stage residual blocks, which are on the verge of being fed into the DMA module.

F_{1}

and

F_{2}

stand for the output feature maps of the DMA module, which are about to be input to the CFF module.

N_{1}

and

N_{2}

symbolize the output feature maps of the CFF module, poised to be input into various decoding stage residual blocks.

H_{D M A}

and

H_{C F F}

stand for the functions of the DMA module and CFF module, respectively.

On the other hand, inspired by thermodynamics, we add the proposed TSB to the bottom-up path. It can cooperate with the decoder in a two-branch super-resolution framework mechanism to assist the semantic segmentation operation.

3.2. Attention-Directed Feature Cross-Aggregation Encoder (AFCE)

3.2.1. Depth-Weighted Multi-Scale Attention Module (DMA)

In conventional encoder structures, the presence of downsampling and pooling layers often causes irreversible loss of detailed features of the image, leaving no way for detailed features to propagate to the deeper layers of the network. Our proposed DMA module can solve this problem well. In this module, the depth information of the upper-level residual block is used as the weight of the attention mechanism to guide its fusion of feature information at multiple scales, generating a fused feature with rich saliency cues to be passed to the next-level residual block.

The detailed structure of the DMA module is shown in Figure 2. To obtain the depth vector, taking feature

C_{1}

as an example, we impose a global average pooling layer and a convolution layer on

C_{1}

, and then leverage the softmax function to derive

C_{depth}

to bootstrap the multi-scale features. The formulation is presented as follows:

C_{d e p t h} = S (C o n v (A v g P o o l i n g (C_{1}))),

(4)

where

C_{d e p t h}

denotes the depth vector obtained by processing.

S (\cdot)

is the representation of the softmax function.

To explore the contextual features of the image at multiple scales, we apply global pooling layers with different expansion rates and different kernel sizes and multiple parallel convolution layers to

C_{1}

to generate six multi-scale features

f_{m}

(m = 1, 2, …, 6) with identical resolution yet distinct contextual information. This process and detailed parameters can be expressed as follows:

f_{1} = δ (B (C o n v_{1 \times 1} (C))),

(5)

f_{2} = δ (B (C o n v_{3 \times 3} (C))),

(6)

f_{3} = δ (B (A C o n v_{d = 7} (C))),

(7)

f_{4} = δ (B (A C o n v_{d = 5} (C))),

(8)

f_{5} = δ (B (A C o n v_{d = 3} (C))),

(9)

f_{6} = U p s a m p l e (δ (B (A C o n v_{1 \times 1} (A P o o l i n g (A P o o l i n g (C)))))),

(10)

where

C o n v (\cdot)

,

A C o n v (\cdot)

, and

A P o o l i n g (\cdot)

denote the convolution, atrous convolution, and atrous spatial pyramid pooling layers with different parameters, respectively.

δ

and

B

denote the rectified linear unit (ReLU) and batch normalization (BN), respectively. Among them, we apply atrous convolution instead of stride convolution to extract image features, which has the advantage of reducing information loss and increasing the perceptual field under the same computational conditions, explicitly maintaining a high-resolution depth feature representation.

Next, the depth vector guides the fusion of these multiscale features in the form of weights to generate a new feature image. This operation can be defined as follows:

F_{1} = \sum_{m = 1}^{M} C_{d e p t h}^{m} \times f_{m},

(11)

Overall, we explore multiple scales of contextual feature images and employ depth cues with rich spatial information to guide the fusion of images, which has a significant effect on weakening the loss caused by traditional encoders for refinement and highlighting image detail features.

3.2.2. Cross-Layer Feature Fusion Module (Cff)

In most of the semantic segmentation methods proposed based on U-Net, designing new ways of connecting between residual blocks is often not negligible. In our approach, we improve the encoder and decoder in the original path, which also leads to a large amount of feature image detail information being mined. In order to leverage these features to their fullest extent, we redesign the jump connection by introducing the CFF module. It can aggregate the low-level details and high-level details from every level of the encoder to make up for the corresponding levels of decoder feature images, fully exploring the full-scale information.

The detailed workflow of the CFF module is depicted in Figure 3. In the first step, the feature map

F_{1}

from the residual block of the first level of the encoder is applied to the maximum pooling layer and the convolution layer with a 3 × 3 kernel and then passes through the BN and ReLU layers to form a feature map with the same resolution size as the feature map from the residual block of the second level. Subsequently, this feature map is incorporated into the second-stage residual linking path, fused and superimposed with the feature map

F_{2}

from the second-stage residual block, and linked to the second-stage residual block layer of the decoder after the deconvolution layer is applied. Similarly, in the second step, the feature maps

F_{2}

from the residual blocks of the second level are processed by a series of similar operations to form feature maps with the same resolution size as the feature maps

F_{1}

from the residual blocks of the first level and are added to the residual linking path of the first level, except that the maximum pooling layer and the convolution layer in the first step are replaced by the deconvolution layer. This process can be formulated as follows:

N_{2} = δ (B (C o n v (M a x P o o l i n g (F_{1}))) + F_{2}),

(12)

N_{1} = δ (B (D e c o n v (F_{2})) + F_{1}),

(13)

This innovative linking approach allows full-scale semantic information to be fully mined and exploited to encompass fine-grained details and coarse-grained semantics comprehensively.

3.3. Thermodynamic Super-Resolution Branch (TSB)

In many existing methods, the decoder can only upsample the low-resolution feature maps passed by the encoder to the same size as the input image for analysis, which may lead to the loss of high-resolution information details in the original image and limit the performance of the network structure. We consider the transmission process of infrared features in the network as the movement of thermal particles, extract the features with super-resolution, and further fuse the super-resolution features according to the Hamming equation in thermodynamics.

Therefore, we add the proposed thermodynamics-inspired cooperative super-resolution module on top of the original decoder structure to solve the above dilemma. We follow a two-branch design and subtly introduce a cooperative mechanism to maintain the high-resolution representation in the presence of low-resolution inputs. In the semantic segmentation branch, we enhance network performance and information utilization by upsampling the prediction masks during both training and testing, effectively utilizing valid label information. This approach outperforms the classical decoder structure, while the added upsampling module exhibits fewer parameters, leading to a substantial reduction in computational complexity. In the super-resolution branch, the fine-grained structural information in the input low-resolution feature maps is reconstructed and guided by feature affinity learning to bring additional high-resolution detail features to the decoder, enhancing the high-resolution representation of semantic segmentation. As depicted in Figure 4, the super-resolution auxiliary branch starts from the end of the encoder and its process can be represented as follows:

X_{b_S R} = C o n v (U p s a m p l e (X_{b o t t o m}))

(14)

X_{1_S R} = C o n v (U p s a m p l e (X_{b_S R})) + C o n v (R e l u (C o n v (U p s a m p l e (X_{b_S R}))))

(15)

X_{2_S R} = C o n v (U p s a m p l e (X_{b_S R})) + C o n v (R e l u (C o n v (U p s a m p l e (X_{1_S R}))))

(16)

X_{3_S R} = C o n v (U p s a m p l e (X_{b_S R})) + C o n v (R e l u (C o n v (U p s a m p l e (X_{2_S R}))))

(17)

where

X_{b o t t o m}

represents the bottom-level feature input from the encoding part to the decoding part, and

X_{1_S R}

,

X_{2_S R}

, and

X_{3_S R}

are the 2× super-resolution feature maps corresponding to different stages

X_{1}

,

X_{2}

, and

X_{3}

of the decoding part. The super-resolution block (SR block) consists of a residual structure of upsampling layers and convolutional layers, which ensures the learnability of the auxiliary branch for infrared targets and helps eliminate the aliasing effects caused by the upsampling process. In this way, when the network is learning target features, each decoding layer has its corresponding auxiliary super-resolution feature map to help capture the detailed features lost in the single-branch low-resolution decoding structure.

Numerical methods represented by Hamming equation [64,65] have a wide range of applications in heat transfer problems [66,67], which can discretize continuous thermodynamic problems into calculations of steps on discrete times. To enhance the feature learning of the segmentation branch by leveraging the features from the super-resolution branch, we draw inspiration from the Hamming method of thermodynamics and conduct feature fusion on the super-resolution features. We use post-super-resolution infrared image features and process the features using convolution and ReLU operations to simulate the features in discrete time, which can be combined with the Hamming equation to further simulate the real target. The Hamming method formula is given by

y_{i + 1} = \frac{1}{8} (y_{i} - y_{i - 2}) + \frac{3 h}{8} (f_{i + 1} + 2 f_{i} - f_{i - 1})

(18)

where

f

denotes the rate of change of the infrared feature in discrete time. In order to utilize the existing super-resolution infrared feature

X_{3_S R}

to simulate the rate of change of the infrared feature under discrete time, we define

f = Y - X

, where

X

and

Y

can be regarded as the input and output of a learning module consisting of two convolutional layers with a kernel size of 3 × 3 and a ReLU layer, such as

X_{i} = Y_{i - 1}

. In addition, where h represents the spatial step size, we make

h = 1

to take into account the stability and information loss problems of modeling super-resolution features, as well as to improve the readability of the module:

Y_{i + 1} = \frac{1}{8} (Y_{i} - Y_{i - 2}) + \frac{3}{8} ((Y_{i + 1} - X_{i + 1}) + 2 (Y_{i} - X_{i}) - (Y_{i - 1} - X_{i - 1}))

(19)

By further utilizing

Y_{i}

,

Y_{i - 1}

, and

Y_{i - 2}

, we can eliminate

X_{i + 1}

,

X_{i}

, and

X_{i - 1}

to obtain

Y_{i + 1} = \frac{4}{5} Y_{i} - \frac{9}{5} Y_{i - 1} + \frac{2}{5} Y_{i - 2}

(20)

Let

{Δ Y}_{i}

represent the residual between

Y_{i}

and

Y_{i - 1}

, that is,

{Δ Y}_{i} = Y_{i} - Y_{i - 1}

:

Y_{i + 1} = - \frac{3}{5} Y_{i} + \frac{7}{5} {Δ Y}_{i} - \frac{2}{5} {Δ Y}_{i - 1}

(21)

The above equation establishes the relationship between

{Δ Y}_{i - 2}

,

{Δ Y}_{i - 1}

,

{Δ Y}_{i}

,

{Δ Y}_{i + 1}

and the final output

{Δ Y}_{i + 1}

with the input image

X

, which is subjected to the mean squared error loss to enhance the feature representation.

3.4. Loss Functions

Due to the significant disparity in the number of pixels between the target and the background in the IRSTD task, there exists a severe class imbalance issue. In response to the issue of category imbalance, we employ the Dice coefficient loss function (Dice Loss), defined as follows:

L_{D i c e} = 1 - \frac{\sum_{n = 1}^{N} p_{n} r_{n} + γ}{\sum_{n = 1}^{N} p_{n} + r_{n} + γ} - \frac{\sum_{n = 1}^{N} {(1 - p}_{n}) (1 - r_{n}) + γ}{\sum_{n = 1}^{N} 2 - p_{n} - r_{n} + γ}

(22)

where

p_{n}

is the probability of a pixel value belonging to the true target,

r_{n}

represents the true class to which the pixel value belongs, and N represents the pixel number.

γ

is a smoothing factor that is used to prevent the denominator of the loss function from becoming zero erroneously.

Due to the larger gradients at the edges of the targets, Dice Loss is biased toward optimizing target samples, effectively addressing the class imbalance issue. However, if the target pixels are very small and are incorrectly predicted, Dice Loss can undergo significant changes, thus affecting the training effectiveness of the network. To address the instability issue caused by the fluctuation of Dice Loss, we also incorporate the cross-entropy loss function (CE Loss). The CE Loss maintains stable performance for both target and background pixels when the network experiences false detections on infrared images.

L_{C E} = - \frac{1}{N} \sum_{n = 1}^{N} (r_{n} log (p_{n}) + (1 - r_{n}) l o g (1 - p_{n}))

(23)

When studying the semantic features in infrared super-resolution images, we utilize the commonly used mean squared error loss function (MSE Loss) in deep learning to supervise both the super-resolution images and the infrared images:

L_{M S E} = \frac{1}{N} \sum_{n = 1}^{N} {∥X_{n} - T S B (Y_{n})∥}^{2}

(24)

where

X

is the input infrared image and

T S B (\cdot)

is the super-resolution output. The loss function used by the network can be represented by semantic segmentation loss weights,

λ_{D i c e}

,

λ_{C E}

, and super-resolution branch loss weight,

λ_{M S E}

.

L_{D C M} = λ_{D i c e} \cdot L_{D i c e} + λ_{C E} \cdot L_{C E} + λ_{M S E} \cdot L_{M S E}

(25)

4. Experiment

In this section, we begin by presenting the experimental setup, followed by an evaluation of the proposed TMNet using the publicly available NUAA-SIRST dataset and compare it with the SOTA method, and finally, we perform a complete ablation study on the performance of the TMNet.

4.1. Experimental Settings

4.1.1. Dataset

Our experimental evaluation is conducted on the publicly available NUAA-SIRST dataset, comprising 427 infrared images with a total of 480 instances. Notably, about 55% of the targets occupy only 0.02% of the image area, often only a few pixels in size. In general, the detection of smaller objects necessitates a larger background context, and the presence of small infrared targets intensifies this difficulty to a significant extent, primarily due to the combination of low contrast and background clutter. Accordingly, this dataset is more challenging for our IRSTD approach. The dataset is divided into three sets, with approximately 50% used for training, 20% for validation, and the remaining 30% for testing.

4.1.2. Evaluation Metrics

For the comparison of the proposed TMNet method with SOTA methods, we utilize the following metrics:

Intersection over union (IoU): IoU is designed to gauge the precision of detecting the corresponding object within a given dataset. It can be defined as follows:

$I o U = T P / (T + P - T P),$

(26)

where T, P, and $T P$ represent true, positive, and true positive target messages, respectively.
Normalized intersection over union (nIoU): nIoU is the normalization of IoU, which is a metric specifically designed for IRSTD. It effectively strikes a balance between the structural similarity and pixel accuracy, especially for small infrared targets. It can be calculated as follows:

$n I o U = \frac{1}{M} \sum_{m = 1}^{M} (T P_{m} / (T_{m} + P_{m} - T P_{m})),$

(27)

where M indicates the overall target messages.
Probability of detection ( $P_{d}$ ): $P_{d}$ can be computed by dividing the count of correctly predicted targets by the total number of targets, i.e.,

$P_{d} = T_{t r u e} / T_{a l l},$

(28)

where $T_{t r u e}$ and $T_{a l l}$ stand for the amount of accurately detected targets and the total amount of targets, respectively.
False alarm rate ( $F_{a}$ ): $F_{a}$ represents the proportion of falsely predicted target pixels in the infrared image relative to all the pixels present, i.e.,

$F_{a} = T_{f a l s e} / T_{a l l},$

(29)

where $T_{f a l s e}$ stands for the amount of incorrectly detected pixels.

4.1.3. Implementation Details

We set the resolution of each image in the NUAA-SIRST dataset to the same 512 × 512 and apply AdaGrad as an optimizer with a learning rate of 0.01. We set reasonable neural network hyperparameters based on the context and goals of the IRSTD task as well as rules of thumb and cross-validation. The training process spans 2000 epochs, with a batch size of 32 and a weight decay of

10^{- 4}

. By default, the threshold value for segmentation is set to 0.5. The implementation of all models takes place within PyTorch, utilizing a workstation equipped with a CPU clocked at 3.50 GHz and an NVIDIA GeForce RTX 2080Ti GPU. We evaluate the proposed TMNet method with SOTA methods at pixel level and object level; for the traditional methods, we select Top-Hat [17], Max-Median [12], IPI [36], NRAM [7], WSLCM [15], TLLCM [16], PSTNN [6], RIPT [68], and MSLSTIPT [69] for comparison. For deep learning-based methods, we choose MDvsFA [18], ACMNet [21], ALCNet [22], FC3-Net [28], and APAFNet [33] for comparison.

4.2. Comparison Results with Sota Methods

4.2.1. Quantitative Results

As illustrated in Table 1, we compare the proposed TMNet model with 14 existing IRSTD methods based on pixel-level metrics and object-level metrics. According to all evaluation metrics on NUAA-SIRST, it can be observed that deep learning-based methods such as MDvsFA, ACMNet, ALCNet, FC3-Net, and APAFNet consistently outperform traditional detection methods, with the proposed TMNet achieving the best detection performance among the deep learning-based methods. In terms of pixel-level metrics (including IoU and nIoU), it can be observed that the proposed TMNet demonstrates its powerful ability to extract semantic information features, enabling effective localization and segmentation of infrared targets. Compared to ACMNet, ALCNet, FC3-Net, and APAFNet, the IoU metrics have improved by 4.8%, 2.8%, 2.3%, and 0.3%, while the nIoU metrics have improved by 3.9%, 2.8%, 1%, and 0.4%, respectively. Those metrics illustrate that TMNet effectively improves on the previous network structure where semantic features are lost and thus affect the ability to represent the target. In terms of object-level metrics (including

P_{d}

and

F_{a}

), since

P_{d}

and

F_{a}

are two mutually limiting metrics, how to improve

P_{d}

while suppressing

F_{a}

becomes the key to network performance improvement. As can be seen from Table 1, the proposed TMNet has better object-level metrics compared with other methods. Compared to ACMNet, ALCNet, FC3-Net, and APAFNet, the

P_{d}

metrics have improved by 1.4%, 0.9%, 0.2%, and 0.2%, while the

F_{a}

metrics have reduced by 3.6%, 13.47% 1.61%, and 1.16%, respectively. Those results indicate that the proposed TMNet effectively improves the IRSTD model’s capability to accurately localize infrared targets and addresses the issues of missed detections and false detections for very small targets by utilizing its rich semantic feature information.

4.2.2. Roc Results

The ROC curve reflects the performance of the IRSTD model at a different segmentation threshold. As shown in Figure 5, we compare the ROC curves of the proposed TMNet with two other traditional methods and four CNN-based IRSTD detection models. In the NUAA-SIRST dataset, TMNet not only demonstrates excellent performance in evaluation metrics such as IoU, nIoU, Pd, and Fa at fixed thresholds but also proves its superiority over other models through ROC curves.

4.2.3. Visual Results

In Figure 6 and Figure 7, we visualize the proposed TMNet with some traditional IRSTD methods and CNN-based deep learning methods for a more intuitive representation of target recognition results. It is evident that the detection results of the traditional IRSTD methods represented by TopHat and IPI methods are not satisfactory, and they can barely detect the targets on the NUAA-SIRST dataset. Although CNN-based deep learning methods improve the detection results compared with traditional methods, they lead to a large amount of loss of target detail information. From the visualization images of test pictures, it can be observed that TMNet, compared to other IRSTD models, generates segmentation masks that are closer to the actual shape of the infrared targets. It also achieves accurate target localization and avoids missing detections and false alarms even in complex backgrounds, demonstrating superior feature extraction capabilities.

5. Discussion

In this section, to assess the efficacy of the TMNet model for IRSTD, we remove the key component modules from the TMNet in order to analyze the results of its ablation experiments that include segmentation detection metrics and metrics for FLOPs and parametric quantities. The overall ablation experimental study results are shown in Table 2. In the same experimental setup, we validate the effectiveness of modules AFCE and TSB on the NUAA-SIRST dataset. For TMNet w/o AFCE, when the AFCE module is removed, the feature cross-layer connections are replaced with simple matrix addition. As for w/o TSB, it only requires the simple removal of the TSB module. As shown in Table 2, it can be seen that after removing the AFCE module, the model’s pixel-level evaluation metrics significantly decrease, indicating that AFCE can effectively extract detailed information from infrared images and greatly impact the network’s ability to capture edge shape information of small targets. On the other hand, when the TSB module is removed, the model’s object-level metrics noticeably decrease, demonstrating that TSB can assist the network in accurately determining the position of targets in the presence of complex clutter and noise, significantly affecting the network’s localization ability for infrared targets. This indicates that both the AFCE and TSB modules have a significant impact on the performance of the TMNet model.

5.1. Analysis of Attention-Directed Feature Cross-Aggregation Encoder (AFCE)

5.1.1. Analysis of Depth-Weighted Multi-Scale Attention (DMA)

The semantic features of a CNN-based model vary at different depths, and the impact of these features on the model’s performance also differs. To explore the influence of the number of DMA on network performance, we conduct ablative experiments by controlling the number of DMA units as 0, 1, and 2. As shown in Table 3, under the same experimental settings, we only vary the number of DMA units in TMNet to study its impact on the model. It can be observed that as the number of DMA units increases, the model achieves better IoU, nIoU, Pd, and Fa metrics. This indicates that the DMA module enables the network to obtain richer semantic features and retain the desired infrared target features within the enriched semantic features.

5.1.2. Analysis of Cross-Layer Feature Fusion (Cff)

In the AFCE module, we utilize the CFF module to implement cross-layer semantic feature interaction; specifically the shallow and deep features from the encoder part are feature-mosaicked separately and then transmitted to the corresponding stage in the decoder part for feature fusion, thus improving the retention of valid features across layers of cross-layer loss of infrared target information.

In Figure 8, to investigate the effectiveness of the interaction between deep features and shallow features in the encoding layers, we have designed two new types of cross-layer interaction modules: Shallow stop Cross-layer Feature Fusion module (SCFF) and Deep stop Cross-layer Feature Fusion module (DCFF). The SCFF module allows the shallow features in the encoding stage to preserve their feature fusion with the corresponding stage features in the decoding stage. Meanwhile, the deep features in the encoding stage are first embedded with the shallow features and then transmitted to the corresponding stage in the decoding stage for feature fusion. On the contrary, the DCFF module restricts the deep features in the encoding stage from undergoing feature embedding and instead directly integrates them with the same-layer features in the decoding stage.

In Table 4, under the same experimental settings, we replace the CFF module in TMNet with SCFF and DCFF for ablative experiments. It can be observed that both pixel-level evaluation metrics and object-level evaluation metrics exhibit significant decreases. Hence, it can be concluded that whether restricting the shallow features in the encoding stage or restricting the deep features in the encoding stage within the CFF module, both cases lead to the network’s inability to effectively improve the feature loss across layers, thus impacting the model’s performance.

5.2. Analysis on Thermodynamic Super-Resolution Branch (TSB)

The TSB module assists the semantic segmentation network in learning rich features from high-resolution images by utilizing a super-resolution branch that corresponds to each stage of the traditional encoding–decoding structure, enhancing the low-resolution input. In Figure 9, the red dashed lines encircle the actual target segmentation regions in the ground truth (GT). From the comparison of images of the same corresponding stages in the decoder feature maps and TSB feature maps, we observe the following: in the decoder’s multiple infrared feature maps, the target segmentation regions are mixed with non-target segmentation areas, and the segmented target information is not as clear and comprehensive. However, in the multiple feature maps of the corresponding TSB stage, the target segmentation regions and non-target segmentation areas are well-distinguished, and the segmented target information is more distinct and comprehensive. This illustrates that the TSB module can effectively assist the segmentation branch network in learning more comprehensive infrared target information.

In Table 5, under the same experimental settings, we compare the non-corresponding super-resolution branch (NCSB), designed as a one-time operation, with TSB. The NCSB module assists the network in learning by performing a single super-resolution operation to bring the bottom-level features of the network to the same size as the output of the TSB module, all in one go. Through comparison, it is evident that the TSB module possesses stronger feature-assisted extraction capabilities, as it can learn richer semantic features at each stage to aid the network in obtaining more accurate shapes and localization of infrared targets.

6. Conclusions

This paper proposes a novel network, TMNet, for the IRSTD task. From the perspective of multi-scale cross-layer feature fusion, we introduce the AFCE module, which incorporates a novel attention mechanism and multi-scale cross-layer feature interaction mechanism to aid the network in effectively extracting valuable target information. In addition, we also observe that the super-resolution features of infrared images contain detailed information that is not present in low-resolution features. Therefore, we develop the TSB module, which utilizes a super-resolution branch that corresponds to each layer of the semantic segmentation network to assist the model in learning high-resolution details in infrared images. This novel thermodynamically inspired two-path synergistic mechanism combines the Hamming equation with a super-resolution process based on the infrared feature propagation law, which effectively enhances the network’s ability to locate the target information and capture the shape details, and thus improves the performance of the infrared detection model. Extensive experiments on the NUAA-SIRST dataset demonstrate that our proposed TMNet outperforms existing models in terms of objective evaluation metrics and visual quality.

Author Contributions

Conceptualization, X.Z.; Methodology, M.Z.; Software, H.Y.; Writing—original draft, K.Y.; Writing—review & editing, Y.Z.; Supervision, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China under Grants 62272363, 62036007, 62061047, 62176195, and U21A20514, the Young Elite Scientists Sponsorship Program by CAST under Grant 2021QNRC001, the Youth Talent Promotion Project of Shaanxi University Science and Technology Association under Grant 20200103.

Data Availability Statement

The NUAA-SIRST dataset is downloaded free of charge from the article according to the link https://github.com/YeRen123455/Infrared-Small-Target-Detection.

Conflicts of Interest

The authors declare no conflict of interest.

References

Law, W.C.; Xu, Z.; Yong, K.T.; Liu, X.; Swihart, M.T.; Seshadri, M.; Prasad, P.N. Manganese-doped near-infrared emitting nanocrystals for in vivo biomedical imaging. Opt. Express 2016, 24, 17553–17561. [Google Scholar] [CrossRef] [PubMed]
Teutsch, M.; Krüger, W. Classification of small boats in infrared images for maritime surveillance. In Proceedings of the 2010 International WaterSide Security Conference, Carrara, Italy, 3–5 November 2010; IEEE: New York, NY, USA, 2010; pp. 1–7. [Google Scholar]
Zhang, J.; Tao, D. Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet Things J. 2020, 8, 7789–7817. [Google Scholar] [CrossRef]
Zhang, M.; Wu, Q.; Guo, J.; Li, Y.; Gao, X. Heat transfer-inspired network for image super-resolution reconstruction. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; He, C.; Zhang, J.; Yang, Y.; Peng, X.; Guo, J. SAR-to-Optical Image Translation via Neural Partial Differential Equations. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna, Austria, 23–29 July 2022; pp. 1644–1650. [Google Scholar]
Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared small target detection via non-convex rank approximation minimization joint l 2, 1 norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
Huang, S.; Liu, Y.; He, Y.; Zhang, T.; Peng, Z. Structure-adaptive clutter suppression for infrared small target detection: Chain-growth filtering. Remote Sens. 2019, 12, 47. [Google Scholar] [CrossRef]
Guan, X.; Zhang, L.; Huang, S.; Peng, Z. Infrared small target detection via non-convex tensor rank surrogate joint local contrast energy. Remote Sens. 2020, 12, 1520. [Google Scholar] [CrossRef]
Zhang, M.; Wu, Q.; Zhang, J.; Gao, X.; Guo, J.; Tao, D. Fluid micelle network for image super-resolution reconstruction. IEEE Trans. Cybern. 2022, 53, 578–591. [Google Scholar] [CrossRef]
Guo, J.; He, C.; Zhang, M.; Li, Y.; Gao, X.; Song, B. Edge-preserving convolutional generative adversarial networks for SAR-to-optical image translation. Remote Sens. 2021, 13, 3575. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets, Denver, CO, USA, 18–23 July 1999; SPIE: San Francisco, CA, USA, 1999; Volume 3809, pp. 74–83. [Google Scholar]
Zhu, H.; Liu, S.; Deng, L.; Li, Y.; Xiao, F. Infrared small target detection via low-rank tensor completion with top-hat regularization. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1004–1016. [Google Scholar] [CrossRef]
Hou, X.; Zhang, L. Saliency detection: A spectral residual approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; IEEE: New York, NY, USA, 2007; pp. 1–8. [Google Scholar]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote Sens. Lett. 2020, 18, 21078718. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 950–959. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Zhang, J.; Guo, J.; Li, Y.; Gao, X. Dim2Clear network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Zhang, M.; Bai, H.; Zhang, J.; Zhang, R.; Wang, C.; Guo, J.; Gao, X. RKformer: Runge-Kutta Transformer with Random-Connection Attention for Infrared Small Target Detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 1730–1738. [Google Scholar]
He, X.; Mo, Z.; Wang, P.; Liu, Y.; Yang, M.; Cheng, J. Ode-inspired network design for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1732–1741. [Google Scholar]
Lu, Y.; Zhong, A.; Li, Q.; Dong, B. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3276–3285. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhang, M.; Yue, K.; Zhang, J.; Li, Y.; Gao, X. Exploring Feature Compensation and Cross-level Correlation for Infrared Small Target Detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 1857–1865. [Google Scholar]
Song, I.; Kim, S. AVILNet: A new pliable network with a novel metric for small-object segmentation and detection in infrared images. Remote Sens. 2021, 13, 555. [Google Scholar] [CrossRef]
Fourure, D.; Emonet, R.; Fromont, E.; Muselet, D.; Tremeau, A.; Wolf, C. Residual conv-deconv grid network for semantic segmentation. arXiv 2017, arXiv:1707.07958. [Google Scholar]
Li, R.; Shen, Y. YOLOSR-IST: A deep learning method for small target detection in infrared remote sensing images based on super-resolution and YOLO. Signal Process. 2023, 208, 108962. [Google Scholar] [CrossRef]
Hao, Y.; Liu, Y.; Zhao, J.; Yu, C. Dual-Domain Prior-Driven Deep Network for Infrared Small-Target Detection. Remote Sens. 2023, 15, 3827. [Google Scholar] [CrossRef]
Wang, Z.; Yang, J.; Pan, Z.; Liu, Y.; Lei, B.; Hu, Y. APAFNet: Single-Frame Infrared Small Target Detection by Asymmetric Patch Attention Fusion. IEEE Geosci. Remote Sens. Lett. 2022, 20, 1–5. [Google Scholar] [CrossRef]
Hadhoud, M.M.; Thomas, D.W. The two-dimensional adaptive LMS (TDLMS) algorithm. IEEE Trans. Circuits Syst. 1988, 35, 485–494. [Google Scholar] [CrossRef]
Konstantin, D.; Zosso, D. Two-dimensional variational mode decomposition. In Proceedings of the Energy Minimization Methods in Computer Vision and Pattern Recognition, Hong Kong, China, 13–16 January 2015; Volume 8932, pp. 197–208. [Google Scholar]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zhao, M.; Cheng, L.; Yang, X.; Feng, P.; Liu, L.; Wu, N. TBC-Net: A real-time detector for infrared small target detection using semantic constraint. arXiv 2019, arXiv:2001.05852. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Zhang, M.; Li, B.; Wang, T.; Bai, H.; Yue, K.; Li, Y. CHFNet: Curvature Half-Level Fusion Network for Single-Frame Infrared Small Target Detection. Remote Sens. 2023, 15, 1573. [Google Scholar] [CrossRef]
Tang, Y.; Wu, X.; Bu, W. Deeply-supervised recurrent convolutional neural network for saliency detection. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 397–401. [Google Scholar]
Khan, M.M.; Ward, R.D.; Ingleby, M. Classifying pretended and evoked facial expressions of positive and negative affective states using infrared measurement of skin temperature. ACM Trans. Appl. Percept. (TAP) 2009, 6, 1–22. [Google Scholar] [CrossRef]
Gu, Z.; Zhou, S.; Niu, L.; Zhao, Z.; Zhang, L. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1921–1929. [Google Scholar]
Li, Z.; Lang, C.; Liew, J.H.; Li, Y.; Hou, Q.; Feng, J. Cross-layer feature pyramid network for salient object detection. IEEE Trans. Image Process. 2021, 30, 4587–4598. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Li, X.; Zhao, H.; Han, L.; Tong, Y.; Tan, S.; Yang, K. Gated fully fusion for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11418–11425. [Google Scholar]
Lee, J.; Kim, D.; Ponce, J.; Ham, B. Sfnet: Learning object-aware semantic correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2278–2287. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Ocnet: Object context network for scene parsing. arXiv 2018, arXiv:1809.00916. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
Zhong, Z.; Lin, Z.Q.; Bidart, R.; Hu, X.; Daya, I.B.; Li, Z.; Zheng, W.S.; Li, J.; Wong, A. Squeeze-and-attention networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13065–13074. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2599–2613. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1664–1673. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Wang, Y.; Perazzi, F.; McWilliams, B.; Sorkine-Hornung, A.; Sorkine-Hornung, O.; Schroers, C. A fully progressive approach to single-image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 864–873. [Google Scholar]
Wang, L.; Li, D.; Zhu, Y.; Tian, L.; Shan, Y. Dual super-resolution learning for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3774–3783. [Google Scholar]
Weinan, E. A proposal on machine learning via dynamical systems. Commun. Math. Stat. 2017, 1, 1–11. [Google Scholar]
Hamming, R.W. Stable predictor-corrector methods for ordinary differential equations. J. ACM 1959, 6, 37–47. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, J.; Huang, Z. Hamming method for solving uncertain differential equations. Appl. Math. Comput. 2017, 313, 331–341. [Google Scholar] [CrossRef]
Laine, M.; Vuorinen, A. Basics of Thermal Field Theory; Lecture Notes in Physics; Springer: Cham, Switzerland, 2016; Volume 925. [Google Scholar]
Romano, G.; Diaco, M.; Barretta, R. Variational formulation of the first principle of continuum thermodynamics. Contin. Mech. Thermodyn. 2010, 22, 177–187. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3737–3752. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed TMNet.

Figure 2. Structure of DMA module.

Figure 3. Structure of CFF module.

Figure 4. Structure of TSB module.

Figure 5. ROC curves on the NUAA-SIRST dataset.

Figure 6. Illustrative comparisons of various methods. From top to bottom: the input test images, followed by the outputs of TopHat, IPI, MDvsFA, ACMNet, FC3-Net, APAFNet, the proposed TMNet (ours), and the ground truth, respectively. The boxes in red, yellow, and blue represent correct, missed, and false detections, respectively.

Figure 7. Three-dimensional visualization of method comparisons. From top to bottom: the input test images, followed by the outputs of TopHat, IPI, MDvsFA, ACMNet, FC3-Net, APAFNet, the proposed TMNet (ours), and the ground truth, respectively. The vertical axis of the three-dimensional visualization results represents the pixel values of the image. For ease of observation, we have mapped it to a range of 0–400.

Figure 8. (a) Structure of Shallow stop Cross-layer Feature Fusion module (SCFF); (b) structure of Deep stop Cross-layer Feature Fusion module (DCFF).

Figure 9. Illustrative comparisons of decoder feature maps and TSB feature maps for the same corresponding stages. For ease of observation, we have highlighted the location of the target in the image with a red circle.

Table 1. Comparison with SOTA methods on the NUAA-SIRST dataset in terms of IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

), the ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance. The bold numbers represent the optimal performance metric.

Table 1. Comparison with SOTA methods on the NUAA-SIRST dataset in terms of IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

), the ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance. The bold numbers represent the optimal performance metric.

Method	Pixel-Level		Object-Level
Method	IoU↑	nIoU↑	Pd↑	Fa↓
Top-Hat [17]	7.14	5.20	79.8	1012
Max-Median [12]	4.17	2.15	69.2	53.3
IPI [36]	25.7	24.6	85.6	11.5
NRAM [7]	12.2	10.2	74.5	13.9
WSLCM [15]	1.16	0.85	77.9	5446
TLLCM [16]	1.03	0.91	79.1	5899
PSTNN [6]	22.4	22.4	77.9	29.1
RIPT [68]	11.1	10.2	79.1	22.6
MSLSTIPT [69]	10.3	9.58	82.1	1131
MDvsFA [18]	60.3	58.3	89.4	56.4
ACMNet [21]	72.3	71.4	96.9	9.33
ALCNet [22]	74.3	73.1	97.4	19.2
FC3-Net [28]	74.8	74.3	98.1	7.34
APAFNet [33]	76.8	74.9	98.1	6.97
TMNet (ours)	77.1	75.3	98.3	5.73

Table 2. Ablation study of TMNet on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

),

F L O P s

(10^{9})

, and number of parameters, i.e.,

P a r a m s

(10^{6})

. The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Table 2. Ablation study of TMNet on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

),

F L O P s

(10^{9})

, and number of parameters, i.e.,

P a r a m s

(10^{6})

. The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Model	Pixel-Level		Object-Level		FLOPs	Params
Model	IoU↑	nIoU↑	Pd↑	Fa↓	FLOPs	Params
TMNet	77.1	75.3	98.3	5.73	3.95	0.74
w/o AFCE	75.6	73.2	96.6	21.6	3.66	0.71
w/o TSB	76.2	72.8	96.3	16.3	2.13	0.54
w/o AFCE&TSB	73.3	70.9	96.1	39.8	1.92	0.50

Table 3. Ablation study of the number of DMA modules on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

). The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Table 3. Ablation study of the number of DMA modules on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

). The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Number of Blocks	Pixel-Level		Object-Level
Number of Blocks	IoU↑	nIoU↑	Pd↑	Fa↓
0	75.6	74.1	96.9	13.4
1	76.1	74.8	97.1	9.43
2	77.1	75.3	98.3	5.73

Table 4. Ablation study of CFF on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

). The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Table 4. Ablation study of CFF on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

). The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Method	Pixel-Level		Object-Level
Method	IoU↑	nIoU↑	Pd↑	Fa↓
CFF	77.1	75.3	98.3	5.73
SCFF	74.3	72.4	95.4	11.2
DCFF	74.7	72.7	97.3	9.87

Table 5. Ablation study of TSB on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

). The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Table 5. Ablation study of TSB on the NUAA-SIRST dataset in IoU(%), nIoU(%),

P_{d}

(%),

F_{a}

(

10^{- 6}

). The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Method	Pixel-Level		Object-Level
Method	IoU↑	nIoU↑	Pd↑	Fa↓
TSB	77.1	75.3	98.3	5.73
NCSB	73.9	70.9	96.1	26.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Yang, H.; Yue, K.; Zhang, X.; Zhu, Y.; Li, Y. Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection. Remote Sens. 2023, 15, 4716. https://doi.org/10.3390/rs15194716

AMA Style

Zhang M, Yang H, Yue K, Zhang X, Zhu Y, Li Y. Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection. Remote Sensing. 2023; 15(19):4716. https://doi.org/10.3390/rs15194716

Chicago/Turabian Style

Zhang, Mingjin, Handi Yang, Ke Yue, Xiaoyu Zhang, Yuqi Zhu, and Yunsong Li. 2023. "Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection" Remote Sensing 15, no. 19: 4716. https://doi.org/10.3390/rs15194716

APA Style

Zhang, M., Yang, H., Yue, K., Zhang, X., Zhu, Y., & Li, Y. (2023). Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection. Remote Sensing, 15(19), 4716. https://doi.org/10.3390/rs15194716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Infrared Small Target Detection

2.2. Cross-Layer Feature Fusion

2.3. Image Super-Resolution

3. Method

3.1. Network Overview

3.2. Attention-Directed Feature Cross-Aggregation Encoder (AFCE)

3.2.1. Depth-Weighted Multi-Scale Attention Module (DMA)

3.2.2. Cross-Layer Feature Fusion Module (Cff)

3.3. Thermodynamic Super-Resolution Branch (TSB)

3.4. Loss Functions

4. Experiment

4.1. Experimental Settings

4.1.1. Dataset

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Comparison Results with Sota Methods

4.2.1. Quantitative Results

4.2.2. Roc Results

4.2.3. Visual Results

5. Discussion

5.1. Analysis of Attention-Directed Feature Cross-Aggregation Encoder (AFCE)

5.1.1. Analysis of Depth-Weighted Multi-Scale Attention (DMA)

5.1.2. Analysis of Cross-Layer Feature Fusion (Cff)

5.2. Analysis on Thermodynamic Super-Resolution Branch (TSB)

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI