Thermodynamics-Inspired Multi-Feature Network for Infrared Small Target Detection

: Infrared small target detection (IRSTD) is widely used in many ﬁelds such as detection and guidance systems and is of great research importance. However, small targets in infrared images are typically small, blurry, feature-poor, and prone to being overwhelmed by noisy backgrounds, posing a signiﬁcant challenge for IRSTD. In this paper, we propose a thermodynamics-inspired multi-feature network (TMNet) for the IRSTD task, which extracts richer and more essential semantic features of infrared targets through cross-layer and multi-scale feature fusion, along with the assistance of a thermodynamics-inspired super-resolution branch. Speciﬁcally, it consists of an attention-directed feature cross-aggregation encoder (AFCE), a U-Net backbone decoder, and a thermodynamic super-resolution branch (TSB). In the shrinkage path, the original encoder structure is reconstructed as AFCE, which contains two depth-weighted multi-scale attention modules (DMA) and a cross-layer feature fusion module (CFF). The DMA and CFF modules achieve self-feature-guided multi-scale feature fusion and cross-layer feature interaction by utilizing semantic features from different stages in the encoding process. In thermodynamics, the difference in the formation of different heat between particles leads to heat transfer between objects, which inspired us to analogize the feature extraction process of gradually focusing the network’s attention to an infrared target under the constraints of the loss function to the process of heat transfer. On the expansion path, the TSB module incorporates the Hamming equation of thermodynamics to mine infrared detail features through heat transfer-inspired high-resolution feature representations while assisting the low-resolution branch to learn high-resolution features. We conduct extensive experiments on the publicly available NUAA-SIRSST dataset and ﬁnd that the proposed TMNet exhibits excellent detection performance in both pixel-level and object-level metrics. This discovery provides us with a relatively dependable guideline for formulating network designs aimed at IRSTD.


Introduction
The identification of small targets in infrared images stands as a pivotal technology within the realm of target recognition.Unlike visible imaging mechanisms, infrared imaging can penetrate obstacles and capture more target information in low-light conditions.Accordingly, IRSTD assumes an indispensable role in various domains [1][2][3][4][5], such as detection and guidance systems, early warning systems, and maritime rescue systems.In general, the distance between the object identified by the IRSTD task and the infrared sensor is very long, and the target often occupies merely a few pixels and lacks comprehensive details such as form and texture.This characteristic has been a thorny hurdle in the domain of IRSTD.Furthermore, the energy of infrared radiation decays with increasing imaging distance, greatly reducing the contrast between target feature information and background noise.Therefore, deeply mining the detailed features of infrared targets and minimizing the accuracy loss in image semantic segmentation is crucial to solving this challenge.
To address the challenges of IRSTD, traditional methods [6][7][8][9][10][11] typically treat it as an issue of image filtering and target enhancement.Based on different directions, such as background spatial consistency [12,13] and target saliency [14][15][16], many innovative methods have been proposed.Based on background spatial consistency, in the field of filtering, maxmedian/max-mean [12] methods suppress edge information by computing the filtering outcomes along various directions and selecting the maximum value from each direction, thereby suppressing edge information.In contrast, top-hat [17] applies a specifically shaped filtering window to traverse the entire image and performs erosion and dilation operations on each pixel to highlight the target.Filtering methods can only suppress relatively simple background clutter, and their performance becomes highly unstable in the presence of complex noise interference.For target saliency methods, they are influenced by the traits of the human visual system (HVS), assuming that the infrared targets are the most salient objects.For example, the spectral residual method [14] focuses on the variations in the image background and extracts the most prominent parts by leveraging spectral residual in the image's spectral domain to eliminate the background.The local contrast methods, such as tri-layer local contrast measure (TLLCM) [16] and weighted strengthened local contrast measure (WSLCM) [15], essentially extract targets by considering the dissimilarity information between the current position and its nearby local neighborhood as the foundation for target extraction.While target saliency methods have superior performance in certain scenarios, they perform poorly in low-contrast environments.
The advancement of deep learning has led to an increasing number of proposed methods that aim to improve the accuracy of infrared target detection and overcome performance instability in complex environments.For instance, Miss Detection vs. False Alarm (MDvsFA) [18] divides the IRSTD into two sub-tasks, each independently handled by two different generation adversarial (GAN) [19] models.To achieve the best detection accuracy, each GAN model focuses on reducing miss detection (MD) or false alarm (FA), respectively, thereby reducing both MD and FA in a multi-objective manner.Networks based on the encoder-decoder [20] architecture have achieved impressive results in the IRSTD domain and have been widely applied.For instance, Dai et al. propose adding an asymmetric context modulation (ACM) [21] module on top of the encoder-decoder structure in the neck region.This module is used to fuse low-level semantic information with high-level semantics, thereby avoiding the loss of feature information from infrared images during the encoding process.In addition to proposing a context semantic attention module suitable for infrared tasks, Dai et al. introduce an attentional local contrast network (ALCNet) [22], which combines traditional local contrast measurement methods with feature learning methods through a designed feature mapping cyclic displacement scheme.By leveraging a bottom-up local attention modulation module, ALCNet embeds low-level semantics into high-level semantics.The network attention mechanism designed for the IRSTD task has yielded promising results.Zhang et al. [23] develop the Runge-Kutta transformer (RKformer) [24] method, which employs concepts of the Runge-Kutta equation [25,26] to design a parallel convolution and transformer [27] approach, replacing the conventional encoding process.Furthermore, a cross-level correlation network (FC3-Net) [28] proposed by Zhang et al. utilizes fine-detail-guided multi-level feature compensation (F-MFC) Module and cross-level feature correlation (CFC) Module to not only compensate for the feature loss resulting from the variation in feature map size but also further amplify the network's capacity to locate and represent the shape of the target.Song et al. proposed the amorphous variable inter-located network (AVILNet) [29] built on GridNet [30], which achieves a time-saving optimally structured network through a multi-scale attention integration module and a unique fusion strategy.YOLOSR-IST [31] proposed by Li et al. effectively improves the leakage and misdetection problems of data-driven detection-based methods through super-resolution methods and transformer-based feature blocks.The dual-domain prior-driven deep network (DPDNet) [32] proposed by Hao et al. includes three driver modules: a sparse feature driver module, a high-frequency feature driver module, and a primary detection module to jointly guide the network to efficiently learn infrared small target features.Furthermore, the asymmetric patch attention fusion network (APAFNet) [33] proposed by Wang et al. achieves more comprehensive semantic information details by modulating high-level semantic information and low-level semantic information in different scenarios through asymmetric patch attention fusion (APAF) modules and expanding context blocks.However, existing research mainly focuses on feature learning from rawresolution images, and rarely utilizes infrared physical features and phenomena for feature extraction and information interaction, which makes the network prone to losing target feature details, thereby affecting detection accuracy.In thermodynamics, particles with different energies move over time in the same closed environment, with particles of higher heat spontaneously interacting with particles of lower heat.In the study of the IRSTD task, the process by which the IRSTD algorithm learns an infrared target can be analogously perceived as a process in which the infrared features spontaneously pass through the network under the constraints of the loss function and ultimately focus the network's attention entirely on the small target region.Therefore, we link the motion of infrared features in the neural network to the phenomenon of heat transfer and further try to model the infrared feature segmentation results closer to the real target by using thermodynamic equations.In addition, the similarity between the process of gradual conversion of an image from low resolution to high resolution by the super-resolution method and the process of gradually focusing the network's attention to the infrared target region by IRSTD inspired us to apply the thermodynamic method with the super-resolution method to the IRSTD task.
For this reason, we propose a thermodynamics-inspired multi-feature network (TM-Net), which takes the backbone network of U-Net as the main structure and consists of an attention-directed feature cross-aggregation encoder (AFCE), U-Net backbone decoder, and thermodynamic super-resolution branch (TSB).In the design of TMNet, we creatively propose to optimize the whole link of the network to improve the network structure from both top-down and bottom-up aspects.In the top-down path, we reconstruct the original encoder and design AFCE, which comprises a series of cascaded rule residual blocks, a cross-layer feature fusion (CFF) module and two depth-weighted multi-scale attention (DMA) modules.The input image is received by the DMA module at each level of residual blocks in the encoder path.Then, the DMA module performs a weighted fusion of feature images by an attention mechanism using depth vectors as weights and passes the results to the next level of residual blocks and the CFF module.Subsequently, the CFF module cross-fuses the feature images from the residual blocks at each level and passes them to the decoder.As a result, we effectively extract multi-scale semantic features and enable cross-layer semantic interaction in the encoding-decoding structure, thereby preserving the rich semantic features of infrared images.In the bottom-up path, we add the TSB module, which can introduce a thermodynamics-inspired cooperative mechanism to superresolution images to assist the semantic segmentation operation.The TSB module combines the Hamming equation to extract super-resolution features, which enables each layer's feature map in the decoding stage to be assisted by corresponding super-resolution feature maps for learning.By adding a super-resolution branch loss function, the branch becomes trainable, resulting in better capturing of high-resolution semantic features while preserving low-resolution features.To assess the efficacy of the proposed TMNet, we perform thorough experiments on the publicly available NUAA-SIRST dataset and conclude that TMNet has a better performance compared to state-of-the-art (SOTA) methods.
Overall, the contributions of this paper are mainly in three aspects: 1.
We introduce an innovative IRSTD model, TMNet, which leverages an innovative super-resolution branch for assisted feature learning and explores and fuses multiscale features through full-link connections, demonstrating outstanding performance on the NUAA-SIRST dataset.

2.
We reconstruct the encoder and propose a new AFCE structure, which utilizes generated depth vectors to induce multi-scale feature image fusion, enabling the comprehensive exploration of spatial detail information features.

3.
We introduce a thermodynamics-inspired cooperative mechanism by creating the TSB, which combines the Hamming equation of the thermodynamic and super-resolution to enhance the high-resolution representation under low-resolution input.

Infrared Small Target Detection
Existing IRSTD methods can be divided into traditional methods and deep learningbased methods.Traditional methods rely on non-learning or heuristic image processing techniques, approaching the IRSTD problem as an image filtering and target enhancement problem.The traditional methods include filter-based methods such as top-hat filter (Top-hat) [17], max-median/max-mean filter [12], two-dimensional least-mean-square (TDLMS) [34] filter and two-dimensional variational mode decomposition (TDVMD) [35] method, as well as target saliency-based methods such as the spectral residual method [14], weighted strengthened local contrast measure (WSLCM) [15] and tri-layer local contrast measure (TLLCM) [16], infrared patch-image (IPI) [36], partial sum of the tensor nuclear norm (PSTNN) [6] and non-convex rank approximation minimization (NARM) [7].Nevertheless, traditional methods are often limited to specific and simple application scenarios, and when dealing with interference from clutter and noise in complex backgrounds, their performance fluctuates significantly, leading to detection failures as they fail to accurately preserve infrared target features.
In order to improve the robustness of the method and make it adaptable to most complex environments, deep learning models based on convolutional neural networks (CNN) [37,38] have gradually shown to possess excellent performance in SIRTD tasks.Zhao et al. propose a lightweight CNN network called TBC-Net [39], which effectively balances the infrared image targets and background through a joint loss function, semantic modulation module, and target extraction module.Dai et al. develop an ALCNet [22], which combines traditional local contrast methods with heuristic approaches to pairing easily lost features with deeper features through the cyclic transfer of feature paths.Additionally, they introduce an ACM [21] module that extracts infrared target information from low-level semantic information using attention mechanisms and integrates it with high-level semantics to obtain more effective infrared semantic features in real time.Li et al. present a dense nested attention network (DNA-Net) [40] by building upon the codingdecoding structure and incorporating densely nested multi-directional feature interaction modules and cascaded feature attention mechanisms.By repeatedly fusing and utilizing features from different time periods, the network effectively harnesses infrared information.Zhao et al. design a miss detection vs. false alarm different generation adversarial (MDvsFAGAN) [18] model, which utilizes adversarial learning to suppress miss detection and false alarm.To better utilize manually crafted features in the presence of complex background interference, Zhang et al. [28] propose an F-MFC module and cross-level feature correlation (CFC) Module, Which effectively restore the image edge information of the target and reduce the loss of infrared features in the network, thereby preserving more information.Furthermore, they develop a curvature half-level fusion network (CHFNet) [41], a model that extracts image edge information based on the curvature feature and achieves more accurate target extraction by fusing and filtering features from each layer.While current CNN-based models for IRSTD demonstrate excellent performance, most of them focus on incorporating attention mechanisms by enhancing the encoding structure.However, they often overlook the overall architecture of the network, resulting in limited richness and effectiveness in feature extraction.

Cross-Layer Feature Fusion
The semantic segmentation network paradigm based on the encoding-decoding structure has shown excellent performance in segmentation tasks, but its high-level semantic features and low-level semantic features are distributed at the two ends of the network, so popular detection networks [42][43][44] aggregate multiple layers of features [45] through cross-layer feature fusion to improve segmentation performance.Common approaches for cross-layer feature fusion include skip connections like UNet [20], Deeplabv3+ [46], and feature pyramid network (FPN) [47], gate-based fusion methods such as gated full fusion network (GFFNet) [48], and alignment of features across different layers using semantic flow, as in SFNet [49].However, long-distance feature propagation pathways still result in semantic feature loss and feature mismatch.To guide the long-distance information flow with semantic features from both ends of the encoder-decoder, many networks employ attention mechanisms, such as dual attention network (DANet) [50], object context network (OCNet) [51], criss-cross network (CCNet) [52], expectation-maximization attention network (EMANet) [53], and squeeze-and-attention network (SANet) [54].In the field of IRSTD, Dai et al. design feature fusion networks, ACMNet [21] and ALCNet [22], based on the interaction between low-level and high-level semantics.Additionally, Zhang et al. introduce cross-layer feature fusion networks such as CHFNet [41] and FC3-Net [28].However, due to the low signal-to-noise ratio of infrared target features and the presence of background clutter, relying solely on a single scale or a single-side network structure makes it challenging to completely overcome the loss of features for small targets.Consequently, we introduce a cross-layer and multi-scale interactive semantic feature fusion mechanism.

Image Super-Resolution
Image super-resolution is the process of restoring an image from a low resolution to a high resolution.Deep learning-based image super-resolution methods have showcased their excellent performance in various tasks in recent years, and the existing mainstream super-resolution methods can be classified into single upsampling methods [55][56][57][58] and multiple upsampling methods [59][60][61] based on the number of samples.Wang et al. propose a dual super-resolution learning (DSRL) [62] method, which introduces image resolution into segmentation tasks for the first time.DSRL applies super-resolution methods to obtain higher-resolution images to assist in segmenting the network, resulting in more accurate segmentation results.Recently, ordinary differential equation (ODE) methods [25,26,63] have shown great potential in the design of neural networks in deep learning.For example, He et al. [25] design a novel super-resolution network based on the forward Euler method.However, in IRSTD tasks, one-off super-resolution image-assisted segmentation makes it difficult to recover the detailed semantic features of small targets lost during encoding.Inspired by the tendency of thermal particles to always gradually transition from an unstable high-temperature state to a more stable low-temperature state, we draw a parallel between the diffusion process of infrared features in the network and this phenomenon.Therefore, we design a novel super-resolution branch based on the Hamming equation of thermodynamics to assist low-resolution features in learning high-resolution features, which in turn assists the segmentation branch in obtaining more accurate semantic features.

Network Overview
The overall architecture of the TMNet is shown in Figure 1.It comprises the Attentiondirected Feature Cross-aggregation Encoder (AFCE), a U-Net backbone decoder, and a thermodynamic Super-resolution Branch (TSB).Considering the superiority of U-Net in infrared small target detection and semantic segmentation accuracy, our network takes the backbone network of U-Net as the main structure and improves the network structure in both top-down and bottom-up aspects over the whole link.On the one hand, we reconstruct the original encoder in the top-down path and design the AFCE, which comprises a series of cascaded rule residual blocks, a CFF module, and two DMA modules.Each DMA module receives the input information from the corresponding residual block, uses the depth cue information as the weight of the attention mechanism to guide the fusion of multi-scale features, generates the fused feature map, and passes it to the next level residual block and the CFF module, which will fully guide the exchange and fusion of features among the residual blocks at each level to achieve higher-accuracy feature extraction.This process can be defined as follows: where C 1 and C 2 signify the output feature maps of distinct encoding stage residual blocks, which are on the verge of being fed into the DMA module.F 1 and F 2 stand for the output feature maps of the DMA module, which are about to be input to the CFF module.N 1 and N 2 symbolize the output feature maps of the CFF module, poised to be input into various decoding stage residual blocks.H DMA and H CFF stand for the functions of the DMA module and CFF module, respectively.On the other hand, inspired by thermodynamics, we add the proposed TSB to the bottom-up path.It can cooperate with the decoder in a two-branch super-resolution framework mechanism to assist the semantic segmentation operation.In conventional encoder structures, the presence of downsampling and pooling layers often causes irreversible loss of detailed features of the image, leaving no way for detailed features to propagate to the deeper layers of the network.Our proposed DMA module can solve this problem well.In this module, the depth information of the upper-level residual block is used as the weight of the attention mechanism to guide its fusion of feature information at multiple scales, generating a fused feature with rich saliency cues to be passed to the next-level residual block.
The detailed structure of the DMA module is shown in Figure 2. To obtain the depth vector, taking feature C 1 as an example, we impose a global average pooling layer and a convolution layer on C 1 , and then leverage the softmax function to derive C depth to bootstrap the multi-scale features.The formulation is presented as follows: where C depth denotes the depth vector obtained by processing.S(•) is the representation of the softmax function.To explore the contextual features of the image at multiple scales, we apply global pooling layers with different expansion rates and different kernel sizes and multiple parallel convolution layers to C 1 to generate six multi-scale features f m (m = 1, 2, . . ., 6) with identical resolution yet distinct contextual information.This process and detailed parameters can be expressed as follows: where Conv(•), AConv(•), and APooling(•) denote the convolution, atrous convolution, and atrous spatial pyramid pooling layers with different parameters, respectively.δ and B denote the rectified linear unit (ReLU) and batch normalization (BN), respectively.Among them, we apply atrous convolution instead of stride convolution to extract image features, which has the advantage of reducing information loss and increasing the perceptual field under the same computational conditions, explicitly maintaining a high-resolution depth feature representation.Next, the depth vector guides the fusion of these multiscale features in the form of weights to generate a new feature image.This operation can be defined as follows: Overall, we explore multiple scales of contextual feature images and employ depth cues with rich spatial information to guide the fusion of images, which has a significant effect on weakening the loss caused by traditional encoders for refinement and highlighting image detail features.

Cross-Layer Feature Fusion Module (Cff)
In most of the semantic segmentation methods proposed based on U-Net, designing new ways of connecting between residual blocks is often not negligible.In our approach, we improve the encoder and decoder in the original path, which also leads to a large amount of feature image detail information being mined.In order to leverage these features to their fullest extent, we redesign the jump connection by introducing the CFF module.It can aggregate the low-level details and high-level details from every level of the encoder to make up for the corresponding levels of decoder feature images, fully exploring the full-scale information.
The detailed workflow of the CFF module is depicted in Figure 3.In the first step, the feature map F 1 from the residual block of the first level of the encoder is applied to the maximum pooling layer and the convolution layer with a 3 × 3 kernel and then passes through the BN and ReLU layers to form a feature map with the same resolution size as the feature map from the residual block of the second level.Subsequently, this feature map is incorporated into the second-stage residual linking path, fused and superimposed with the feature map F 2 from the second-stage residual block, and linked to the second-stage residual block layer of the decoder after the deconvolution layer is applied.Similarly, in the second step, the feature maps F 2 from the residual blocks of the second level are processed by a series of similar operations to form feature maps with the same resolution size as the feature maps F 1 from the residual blocks of the first level and are added to the residual linking path of the first level, except that the maximum pooling layer and the convolution layer in the first step are replaced by the deconvolution layer.This process can be formulated as follows: This innovative linking approach allows full-scale semantic information to be fully mined and exploited to encompass fine-grained details and coarse-grained semantics comprehensively.

Thermodynamic Super-Resolution Branch (TSB)
In many existing methods, the decoder can only upsample the low-resolution feature maps passed by the encoder to the same size as the input image for analysis, which may lead to the loss of high-resolution information details in the original image and limit the performance of the network structure.We consider the transmission process of infrared features in the network as the movement of thermal particles, extract the features with super-resolution, and further fuse the super-resolution features according to the Hamming equation in thermodynamics.
Therefore, we add the proposed thermodynamics-inspired cooperative super-resolution module on top of the original decoder structure to solve the above dilemma.We follow a two-branch design and subtly introduce a cooperative mechanism to maintain the high-resolution representation in the presence of low-resolution inputs.In the semantic segmentation branch, we enhance network performance and information utilization by upsampling the prediction masks during both training and testing, effectively utilizing valid label information.This approach outperforms the classical decoder structure, while the added upsampling module exhibits fewer parameters, leading to a substantial reduction in computational complexity.In the super-resolution branch, the fine-grained structural information in the input low-resolution feature maps is reconstructed and guided by feature affinity learning to bring additional high-resolution detail features to the decoder, enhancing the high-resolution representation of semantic segmentation.As depicted in Figure 4, the super-resolution auxiliary branch starts from the end of the encoder and its process can be represented as follows: where X bottom represents the bottom-level feature input from the encoding part to the decoding part, and X 1_SR , X 2_SR , and X 3_SR are the 2× super-resolution feature maps corresponding to different stages X 1 , X 2 , and X 3 of the decoding part.The super-resolution block (SR block) consists of a residual structure of upsampling layers and convolutional layers, which ensures the learnability of the auxiliary branch for infrared targets and helps eliminate the aliasing effects caused by the upsampling process.In this way, when the network is learning target features, each decoding layer has its corresponding auxiliary super-resolution feature map to help capture the detailed features lost in the single-branch low-resolution decoding structure.Numerical methods represented by Hamming equation [64,65] have a wide range of applications in heat transfer problems [66,67], which can discretize continuous thermodynamic problems into calculations of steps on discrete times.To enhance the feature learning of the segmentation branch by leveraging the features from the super-resolution branch, we draw inspiration from the Hamming method of thermodynamics and conduct feature fusion on the super-resolution features.We use post-super-resolution infrared image features and process the features using convolution and ReLU operations to simulate the features in discrete time, which can be combined with the Hamming equation to further simulate the real target.The Hamming method formula is given by where f denotes the rate of change of the infrared feature in discrete time.In order to utilize the existing super-resolution infrared feature X 3_SR to simulate the rate of change of the infrared feature under discrete time, we define f = Y − X, where X and Y can be regarded as the input and output of a learning module consisting of two convolutional layers with a kernel size of 3 × 3 and a ReLU layer, such as X i = Y i−1 .In addition, where h represents the spatial step size, we make h = 1 to take into account the stability and information loss problems of modeling super-resolution features, as well as to improve the readability of the module: By further utilizing Y i , Y i−1 , and Y i−2 , we can eliminate X i+1 , X i , and X i−1 to obtain Let ∆Y i represent the residual between Y i and The above equation establishes the relationship between ∆Y i−2 , ∆Y i−1 , ∆Y i , ∆Y i+1 and the final output ∆Y i+1 with the input image X, which is subjected to the mean squared error loss to enhance the feature representation.

Loss Functions
Due to the significant disparity in the number of pixels between the target and the background in the IRSTD task, there exists a severe class imbalance issue.In response to the issue of category imbalance, we employ the Dice coefficient loss function (Dice Loss), defined as follows: where p n is the probability of a pixel value belonging to the true target, r n represents the true class to which the pixel value belongs, and N represents the pixel number.γ is a smoothing factor that is used to prevent the denominator of the loss function from becoming zero erroneously.Due to the larger gradients at the edges of the targets, Dice Loss is biased toward optimizing target samples, effectively addressing the class imbalance issue.However, if the target pixels are very small and are incorrectly predicted, Dice Loss can undergo significant changes, thus affecting the training effectiveness of the network.To address the instability issue caused by the fluctuation of Dice Loss, we also incorporate the crossentropy loss function (CE Loss).The CE Loss maintains stable performance for both target and background pixels when the network experiences false detections on infrared images.
When studying the semantic features in infrared super-resolution images, we utilize the commonly used mean squared error loss function (MSE Loss) in deep learning to supervise both the super-resolution images and the infrared images: where X is the input infrared image and TSB(•) is the super-resolution output.The loss function used by the network can be represented by semantic segmentation loss weights, λ Dice , λ CE , and super-resolution branch loss weight, λ MSE .

Experiment
In this section, we begin by presenting the experimental setup, followed by an evaluation of the proposed TMNet using the publicly available NUAA-SIRST dataset and compare it with the SOTA method, and finally, we perform a complete ablation study on the performance of the TMNet.

Dataset
Our experimental evaluation is conducted on the publicly available NUAA-SIRST dataset, comprising 427 infrared images with a total of 480 instances.Notably, about 55% of the targets occupy only 0.02% of the image area, often only a few pixels in size.In general, the detection of smaller objects necessitates a larger background context, and the presence of small infrared targets intensifies this difficulty to a significant extent, primarily due to the combination of low contrast and background clutter.Accordingly, this dataset is more challenging for our IRSTD approach.The dataset is divided into three sets, with approximately 50% used for training, 20% for validation, and the remaining 30% for testing.

Evaluation Metrics
For the comparison of the proposed TMNet method with SOTA methods, we utilize the following metrics: 1.
Intersection over union (IoU): IoU is designed to gauge the precision of detecting the corresponding object within a given dataset.It can be defined as follows: where T, P, and TP represent true, positive, and true positive target messages, respectively.2.
Normalized intersection over union (nIoU): nIoU is the normalization of IoU, which is a metric specifically designed for IRSTD.It effectively strikes a balance between the structural similarity and pixel accuracy, especially for small infrared targets.It can be calculated as follows: where M indicates the overall target messages.

3.
Probability of detection (P d ): P d can be computed by dividing the count of correctly predicted targets by the total number of targets, i.e., where T true and T all stand for the amount of accurately detected targets and the total amount of targets, respectively.4.
False alarm rate (F a ): F a represents the proportion of falsely predicted target pixels in the infrared image relative to all the pixels present, i.e., where T f alse stands for the amount of incorrectly detected pixels.

Implementation Details
We set the resolution of each image in the NUAA-SIRST dataset to the same 512 × 512 and apply AdaGrad as an optimizer with a learning rate of 0.01.We set reasonable neural network hyperparameters based on the context and goals of the IRSTD task as well as rules of thumb and cross-validation.The training process spans 2000 epochs, with a batch size of 32 and a weight decay of 10 −4 .By default, the threshold value for segmentation is set to 0.5.The implementation of all models takes place within PyTorch, utilizing a workstation equipped with a CPU clocked at 3.50 GHz and an NVIDIA GeForce RTX 2080Ti GPU.We evaluate the proposed TMNet method with SOTA methods at pixel level and object level; for the traditional methods, we select Top-Hat [17], Max-Median [12], IPI [36], NRAM [7], WSLCM [15], TLLCM [16], PSTNN [6], RIPT [68], and MSLSTIPT [69] for comparison.For deep learning-based methods, we choose MDvsFA [18], ACMNet [21], ALCNet [22], FC3-Net [28], and APAFNet [33] for comparison.

Quantitative Results
As illustrated in Table 1, we compare the proposed TMNet model with 14 existing IRSTD methods based on pixel-level metrics and object-level metrics.According to all evaluation metrics on NUAA-SIRST, it can be observed that deep learning-based methods such as MDvsFA, ACMNet, ALCNet, FC3-Net, and APAFNet consistently outperform traditional detection methods, with the proposed TMNet achieving the best detection performance among the deep learning-based methods.In terms of pixel-level metrics (including IoU and nIoU), it can be observed that the proposed TMNet demonstrates its powerful ability to extract semantic information features, enabling effective localization and segmentation of infrared targets.Compared to ACMNet, ALCNet, FC3-Net, and APAFNet, the IoU metrics have improved by 4.8%, 2.8%, 2.3%, and 0.3%, while the nIoU metrics have improved by 3.9%, 2.8%, 1%, and 0.4%, respectively.Those metrics illustrate that TMNet effectively improves on the previous network structure where semantic features are lost and thus affect the ability to represent the target.In terms of object-level metrics (including P d and F a ), since P d and F a are two mutually limiting metrics, how to improve P d while suppressing F a becomes the key to network performance improvement.As can be seen from Table 1, the proposed TMNet has better object-level metrics compared with other methods.Compared to ACMNet, ALCNet, FC3-Net, and APAFNet, the P d metrics have improved by 1.4%, 0.9%, 0.2%, and 0.2%, while the F a metrics have reduced by 3.6%, 13.47% 1.61%, and 1.16%, respectively.Those results indicate that the proposed TMNet effectively improves the IRSTD model's capability to accurately localize infrared targets and addresses the issues of missed detections and false detections for very small targets by utilizing its rich semantic feature information.
Table 1.Comparison with SOTA methods on the NUAA-SIRST dataset in terms of IoU(%), nIoU(%), P d (%), F a (10 −6 ), the ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.The bold numbers represent the optimal performance metric.

Roc Results
The ROC curve reflects the performance of the IRSTD model at a different segmentation threshold.As shown in Figure 5, we compare the ROC curves of the proposed TMNet with two other traditional methods and four CNN-based IRSTD detection models.In the NUAA-SIRST dataset, TMNet not only demonstrates excellent performance in evaluation metrics such as IoU, nIoU, Pd, and Fa at fixed thresholds but also proves its superiority over other models through ROC curves.

Visual Results
In Figures 6 and 7, we visualize the proposed TMNet with some traditional IRSTD methods and CNN-based deep learning methods for a more intuitive representation of target recognition results.It is evident that the detection results of the traditional IRSTD methods represented by TopHat and IPI methods are not satisfactory, and they can barely detect the targets on the NUAA-SIRST dataset.Although CNN-based deep learning methods improve the detection results compared with traditional methods, they lead to a large amount of loss of target detail information.From the visualization images of test pictures, it can be observed that TMNet, compared to other IRSTD models, generates segmentation masks that are closer to the actual shape of the infrared targets.It also achieves accurate target localization and avoids missing detections and false alarms even in complex backgrounds, demonstrating superior feature extraction capabilities.From top to bottom: the input test images, followed by the outputs of TopHat, IPI, MDvsFA, ACMNet, FC3-Net, APAFNet, the proposed TMNet (ours), and the ground truth, respectively.The vertical axis of the three-dimensional visualization results represents the pixel values of the image.For ease of observation, we have mapped it to a range of 0-400.

Discussion
In this section, to assess the efficacy of the TMNet model for IRSTD, we remove the key component modules from the TMNet in order to analyze the results of its ablation experiments that include segmentation detection metrics and metrics for FLOPs and parametric quantities.The overall ablation experimental study results are shown in Table 2.In the same experimental setup, we validate the effectiveness of modules AFCE and TSB on the NUAA-SIRST dataset.For TMNet w/o AFCE, when the AFCE module is removed, the feature cross-layer connections are replaced with simple matrix addition.As for w/o TSB, it only requires the simple removal of the TSB module.As shown in Table 2, it can be seen that after removing the AFCE module, the model's pixel-level evaluation metrics significantly decrease, indicating that AFCE can effectively extract detailed information from infrared images and greatly impact the network's ability to capture edge shape information of small targets.On the other hand, when the TSB module is removed, the model's object-level metrics noticeably decrease, demonstrating that TSB can assist the network in accurately determining the position of targets in the presence of complex clutter and noise, significantly affecting the network's localization ability for infrared targets.This indicates that both the AFCE and TSB modules have a significant impact on the performance of the TMNet model.The semantic features of a CNN-based model vary at different depths, and the impact of these features on the model's performance also differs.To explore the influence of the number of DMA on network performance, we conduct ablative experiments by controlling the number of DMA units as 0, 1, and 2. As shown in Table 3, under the same experimental settings, we only vary the number of DMA units in TMNet to study its impact on the model.It can be observed that as the number of DMA units increases, the model achieves better IoU, nIoU, Pd, and Fa metrics.This indicates that the DMA module enables the network to obtain richer semantic features and retain the desired infrared target features within the enriched semantic features.In the AFCE module, we utilize the CFF module to implement cross-layer semantic feature interaction; specifically the shallow and deep features from the encoder part are feature-mosaicked separately and then transmitted to the corresponding stage in the decoder part for feature fusion, thus improving the retention of valid features across layers of cross-layer loss of infrared target information.
In Figure 8, to investigate the effectiveness of the interaction between deep features and shallow features in the encoding layers, we have designed two new types of cross-layer interaction modules: Shallow stop Cross-layer Feature Fusion module (SCFF) and Deep stop Cross-layer Feature Fusion module (DCFF).The SCFF module allows the shallow features in the encoding stage to preserve their feature fusion with the corresponding stage features in the decoding stage.Meanwhile, the deep features in the encoding stage are first embedded with the shallow features and then transmitted to the corresponding stage in the decoding stage for feature fusion.On the contrary, the DCFF module restricts the deep features in the encoding stage from undergoing feature embedding and instead directly integrates them with the same-layer features in the decoding stage.
In Table 4, under the same experimental settings, we replace the CFF module in TMNet with SCFF and DCFF for ablative experiments.It can be observed that both pixel-level evaluation metrics and object-level evaluation metrics exhibit significant decreases.Hence, it can be concluded that whether restricting the shallow features in the encoding stage or restricting the deep features in the encoding stage within the CFF module, both cases lead to the network's inability to effectively improve the feature loss across layers, thus impacting the model's performance.

Analysis on Thermodynamic Super-Resolution Branch (TSB)
The TSB module assists the semantic segmentation network in learning rich features from high-resolution images by utilizing a super-resolution branch that corresponds to each stage of the traditional encoding-decoding structure, enhancing the low-resolution input.In Figure 9, the red dashed lines encircle the actual target segmentation regions in the ground truth (GT).From the comparison of images of the same corresponding stages in the decoder feature maps and TSB feature maps, we observe the following: in the decoder's multiple infrared feature maps, the target segmentation regions are mixed with segmentation areas, and the segmented target information is not as clear and comprehensive.However, in the multiple feature maps of the corresponding TSB stage, the target segmentation regions and non-target segmentation areas are well-distinguished, and the segmented target information is more distinct and comprehensive.This illustrates that the TSB module can effectively assist the segmentation branch network in learning more comprehensive infrared target information.
In Table 5, under the same experimental settings, we compare the non-corresponding super-resolution branch (NCSB), designed as a one-time operation, with TSB.The NCSB module assists the network in learning by performing a single super-resolution operation to bring the bottom-level features of the network to the same size as the output of the TSB module, all in one go.Through comparison, it is evident that the TSB module possesses stronger feature-assisted extraction capabilities, as it can learn richer semantic features at each stage to aid the network in obtaining more accurate shapes and localization of infrared targets.

GT
Decoder TSB IR Image Figure 9. Illustrative comparisons of decoder feature maps and TSB feature maps for the same corresponding stages.For ease of observation, we have highlighted the location of the target in the image with a red circle.
Table 5. Ablation study of TSB on the NUAA-SIRST dataset in IoU(%), nIoU(%), P d (%), F a (10 −6 ).The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.

Conclusions
This paper proposes a novel network, TMNet, for the IRSTD task.From the perspective of multi-scale cross-layer feature fusion, we introduce the AFCE module, which incorporates a novel attention mechanism and multi-scale cross-layer feature interaction mechanism to aid the network in effectively extracting valuable target information.In addition, we also observe that the super-resolution features of infrared images contain detailed that is not present in low-resolution features.Therefore, we develop the TSB module, which utilizes a super-resolution branch that corresponds to each layer of the semantic segmentation network to assist the model in learning high-resolution details in infrared images.This novel thermodynamically inspired two-path synergistic mechanism combines the Hamming equation with a super-resolution process based on the infrared feature propagation law, which effectively enhances the network's ability to locate the target information and capture the shape details, and thus improves the performance of the infrared detection model.Extensive experiments on the NUAA-SIRST dataset demonstrate that our proposed TMNet outperforms existing models in terms of objective evaluation metrics and visual quality.

Figure 1 .
Figure 1.The overall architecture of the proposed TMNet.

Figure 6 .
Figure 6.Illustrative comparisons of various methods.From top to bottom: the input test images, followed by the outputs of TopHat, IPI, MDvsFA, ACMNet, FC3-Net, APAFNet, the proposed TMNet (ours), and the ground truth, respectively.The boxes in red, yellow, and blue represent correct, missed, and false detections, respectively.

Figure 7 .
Figure7.Three-dimensional visualization of method comparisons.From top to bottom: the input test images, followed by the outputs of TopHat, IPI, MDvsFA, ACMNet, FC3-Net, APAFNet, the proposed TMNet (ours), and the ground truth, respectively.The vertical axis of the three-dimensional visualization results represents the pixel values of the image.For ease of observation, we have mapped it to a range of 0-400.

Table 3 .
Ablation study of the number of DMA modules on the NUAA-SIRST dataset in IoU(%), nIoU(%), P d (%), F a (10 −6 ).The ↑ represents higher values of this metric indicating better performance, while the ↓ represents lower values of this metric indicating better performance.